Keyword Analysis with PostgreSQL: Algorithms for Text Analysis with the Smlar Plug-in

Background

Example 1

"I want a dog"  // 狗  
"I want a chihuahua" // 吉娃娃狗
"dog|chihuahua"

Example 2

PostgreSQL Databases Do Not Use IDF to Calculate the Rank by Default

Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:  

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":

"I want a dog" // 狗
"I want a chihuahua" // 奇瓦瓦狗

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:

"dog lovers have an average of 1.5 dogs"

In short: higher term frequency(TF) within the document results in a higher rank, but a lower term frequency in the corpus has no impact.

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)
Postgres does not use TF-IDF as a similarity measure among documents.

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF.

It also lets you turn tsvectors into arrays, and supports fast indexing.

Introduction to the smlar Plug-in

Function API

float4 smlar(anyarray, anyarray)  
- computes similary of two arrays. Arrays should be the same type.
float4 smlar(anyarray, anyarray, bool useIntersect)  
- computes similary of two arrays of composite types. Composite type looks like:
CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
useIntersect option points to use only intersected elements in denominator
see an exmaples in sql/composite_int4.sql or sql/composite_text.sql
postgres=# create type tp as (c1 text, c2 float4);
CREATE TYPE
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp);
smlar
----------
0.816497
(1 row)
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp,true);
smlar
-------
1
(1 row)
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.2), ('中国',1.1),('china',1)]::_tp,true);
smlar
----------
0.999822
(1 row)
与顺序无关
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('中国',1.1),('你好',2.2),('china',1)]::_tp,true);
smlar
----------
0.999822
(1 row)
elog(ERROR, "GiST  doesn't support composite (weighted) type");
float4 smlar( anyarray a, anyarray b, text formula );  
- computes similary of two arrays by given formula, arrays should
be the same type.
Predefined variables in formula:
N.i - number of common elements in both array (intersection)
N.a - number of uniqueelements in first array
N.b - number of uniqueelements in second array
Example:
smlar('{1,4,6}'::int[], '{5,4,6}' )
smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )
That calls are equivalent.

Operator API

anyarray % anyarray  
- returns true if similarity of that arrays is greater than limit

float4 show_smlar_limit() - deprecated
- shows the limit for % operation

float4 set_smlar_limit(float4) - deprecated
- sets the limit for % operation

Use instead of show_smlar_limit/set_smlar_limit GUC variable
smlar.threshold (see below)

Conversion Function API

text[] tsvector2textarray(tsvector)  
- transforms tsvector type to text array
anyarray array_unique(anyarray)  
- sort and unique array
float4 inarray(anyarray, anyelement)  
- returns zero if second argument does not present in a first one
and 1.0 in opposite case
float4 inarray(anyarray, anyelement, float4, float4)  
- returns fourth argument if second argument does not present in
a first one and third argument in opposite case

Parameter

smlar.threshold  FLOAT  
Array's with similarity lower than threshold are not similar
by % operation
smlar.persistent_cache BOOL  
Cache of global stat is stored in transaction-independent memory
smlar.type  STRING  
Type of similarity formula: cosine(default), tfidf, overlap
switch(getSmlType())  
{
case ST_TFIDF:
PG_RETURN_FLOAT4( TFIDFSml(sa, sb) );
break;
case ST_COSINE:
{
int cnt;
double power;

power = ((double)(sa->nelems)) * ((double)(sb->nelems));
cnt = numOfIntersect(sa, sb);

PG_RETURN_FLOAT4( ((double)cnt) / sqrt( power ) );
}
break;
case ST_OVERLAP:
{
float4 res = (float4)numOfIntersect(sa, sb);

PG_RETURN_FLOAT4(res);
}
break;
default:
elog(ERROR,"Unsupported formula type of similarity");
}
smlar.stattable    STRING  
Name of table stored set-wide statistic. Table should be
defined as

CREATE TABLE table_name (
value data_type UNIQUE, -- 词
ndoc int4 (or bigint) NOT NULL CHECK (ndoc>0) -- 该词一共出现在几个文本中
);

And row with null value means total number of documents.
See an examples in sql/*g.sql files

Note: used on for smlar.type = 'tfidf'
smlar.tf_method STRING  
Calculation method for term frequency. Values:
"n" - simple counting of entries (default)
"log" - 1 + log(n)
"const" - TF is equal to 1
Note: used on for smlar.type = 'tfidf'
smlar.idf_plus_one BOOL  
If false (default), calculate idf as log(d/df),
if true - as log(1+d/df)
Note: used on for smlar.type = 'tfidf'

Recommended Parameter Configuration

Module provides several GUC variables smlar.threshold, it's highly  
recommended to add to postgesql.conf:

custom_variable_classes = 'smlar' # list of custom variable class names
smlar.threshold = 0.6 #or any other value > 0 and < 1

and other smlar.* variables

Index OPS Class

GiST/GIN support for  %  and  &&  operations for:  
Array Type | GIN operator class | GiST operator class
---------------+----------------------+----------------------
bit[] | _bit_sml_ops |
bytea[] | _bytea_sml_ops | _bytea_sml_ops
char[] | _char_sml_ops | _char_sml_ops
cidr[] | _cidr_sml_ops | _cidr_sml_ops
date[] | _date_sml_ops | _date_sml_ops
float4[] | _float4_sml_ops | _float4_sml_ops
float8[] | _float8_sml_ops | _float8_sml_ops
inet[] | _inet_sml_ops | _inet_sml_ops
int2[] | _int2_sml_ops | _int2_sml_ops
int4[] | _int4_sml_ops | _int4_sml_ops
int8[] | _int8_sml_ops | _int8_sml_ops
interval[] | _interval_sml_ops | _interval_sml_ops
macaddr[] | _macaddr_sml_ops | _macaddr_sml_ops
money[] | _money_sml_ops |
numeric[] | _numeric_sml_ops | _numeric_sml_ops
oid[] | _oid_sml_ops | _oid_sml_ops
text[] | _text_sml_ops | _text_sml_ops
time[] | _time_sml_ops | _time_sml_ops
timestamp[] | _timestamp_sml_ops | _timestamp_sml_ops
timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops
timetz[] | _timetz_sml_ops | _timetz_sml_ops
varbit[] | _varbit_sml_ops |
varchar[] | _varchar_sml_ops | _varchar_sml_ops

How to Calculate Similarity in TF-IDF Mode by Using smlar: An Example

/* see http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html */  
create extension if not exists smlar;
/* your basic table to search against: */  
create table documents (
document_id int primary key,
body text not null
);
/*   
I created 100,000 "lorem ipsum" documents here http://www.mockaroo.com/c5418bd0
In retrospect, not a great choice due to the small number of unique words used to generate the dataset
*/
copy documents
from program 'curl "http://www.mockaroo.com/c5418bd0/download?count=100000&key=5b15a410"'
with (format csv, header true);
copy documents   
from '/home/digoal/test.csv'
with (format csv, header true);
/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */  
create table documents_body_stats (
value text unique,
ndoc int not null
);
/* used ts_stat for convenience, not ideal, but good for quick n dirty: */  
insert into documents_body_stats
select
word,
ndoc
from
ts_stat( 'select to_tsvector(''simple'', body) from documents' );
/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */  
insert into documents_body_stats values
(null, (select count(*) from documents) );
/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */  
create or replace function tokenize(text) returns text[] language sql strict immutable as $$
select regexp_split_to_array( lower($1), '[^[:alnum:]]' );
$$;
postgres=# select tsvector2textarray(to_tsvector('i am digoal digoal')),  to_tsvector('i am digoal digoal');  
tsvector2textarray | to_tsvector
--------------------+--------------
{digoal} | 'digoal':3,4
(1 row)
/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */  
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds
/* setup smlar: */  
set smlar.type = 'tfidf';
set smlar.stattable = 'documents_body_stats';
set smlar.tf_method = 'n';
set smlar.threshold = 0.4;
/* the query */  
select
*,
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )
from
documents
where
tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold
order by
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc
limit 10;


postgres=# explain (analyze,verbose,timing,costs,buffers) select
*,
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )
from
documents
where
tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold
order by
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc
limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))
Buffers: shared hit=79
-> Sort (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=79
-> Index Scan using documents_tokenize_idx on public.documents (cost=0.14..2.17 rows=1 width=237) (actual time=7.641..7.641 rows=0 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])
Rows Removed by Index Recheck: 61
Buffers: shared hit=79
Planning time: 0.148 ms
Execution time: 7.703 ms
(15 rows)
postgres=# set smlar.threshold = 0.8;  
SET
postgres=# explain (analyze,verbose,timing,costs,buffers) select
*,
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )
from
documents
where
tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold
order by
smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc
limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.18..2.18 rows=1 width=237) (actual time=1.051..1.051 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))
Buffers: shared hit=29
-> Sort (cost=2.18..2.18 rows=1 width=237) (actual time=1.049..1.049 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=29
-> Index Scan using documents_tokenize_idx on public.documents (cost=0.14..2.17 rows=1 width=237) (actual time=1.042..1.042 rows=0 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])
Rows Removed by Index Recheck: 1
Buffers: shared hit=29
Planning time: 0.190 ms
Execution time: 1.113 ms
(15 rows)

如果你不需要对结果按相似度排序的话,也可以把order by去掉,提升性能。

IDF of MADlib Training Text

  1. Often there is an initial first level processing to handle tokenization, word stemming, filter stopwords, etc. There is no right or wrong way to handle this step and there is a fair amount of art to doing it well.
  2. Next, build feature vectors for each document. There are a variety of methods for handling this, one common example would be to build tf-idfs. For this, you need the following metrics:

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com