Keyword Analysis with PostgreSQL: Cosine and Linear Correlation Algorithms for Text Analysis

Background

Usually, websites have tag functionality and automatically generate tags based on content on a web page. In simple words, tags are keywords on a web page, for example, a web page to sell mobile phones has tags related to mobile phone categories and features. Now, you must be wondering how are tags generated?

Term Frequency-Inverse Document Frequency Algorithm

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical method for evaluating the importance of a word or term for a document in a collection or corpus. The importance of a word or term increases in a document but decreases in the corpus when its occurrence frequency is high.

  • IDF is a weight for measuring the universal importance of a word. A commoner word has a lower IDF.
  • To obtain the IDF of a word, divide the number of documents that contain the word by the total number of documents to obtain the logarithm.
  • To obtain the importance of a word in a document, multiply TF and IDF. The TF-IDF algorithm filters out common words and retains important words.

Examples

Example 1

Many mathematical formulas are used to calculate TF-IDF value. This example uses the following mathematical formulas.

Example 2

On a web page with 1,000 words, the word “atomic energy”, “of”, and “application” occur 2, 35, and 5 times, respectively. Then, their TF values are 0.002, 0.035, and 0.005, respectively.

Lexicon

Lexicon and IDF

Lexicons are used in input methods. For example, if you are involved in the IT industry, you may use multiple IT industry terms to communicate with your colleagues. Phrases in a lexicon can be used for word-breaking. The IDF of a word in lexicons for different industries should be different because their collections are different. For example, there are 10 billion documents for the IT industry and 1 billion documents for the medical industry. Will the IDF of “biology” in lexicons for both industries be the same?

IDF Calculation

Similar to the census, you can perform sampling or network-wide computing to calculate the IDF value. Since search engines browse content on various websites every day, their content may be humungous. The IDF value calculated on the basis of the search engine content is accurate. However, the industry-specific IDF value needs to be calculated to extract accurate keywords in the respective verticals.

MADlib TF Measurement API

MADlib is an open-source machine learning library developed by Pivotal and Berkeley University. It can be used in the Greenplum, PostgreSQL, HAWQ, and other databases.

term_frequency(input_table,    
doc_id_col,
word_col,
output_table,
compute_vocab)
Arguments:    

input_table
TEXT.
The name of the table storing the documents.
Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document.
The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.

id_col
TEXT.
The name of the column containing the document id.

word_col
TEXT.
The name of the column containing the vector of words/terms in the document.
This column should of type that can be cast to TEXT[], 例如tsvector.

output_table
TEXT.
The name of the table to store the term frequency output.
The output table contains the following columns:
id_col:
This the document id column (same as the one provided as input).
word:
A word/term present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).
count:
The number of times this word is found in the document.
compute_vocab:
BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created.
If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word.
The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:
wordid:
An id assignment for each word
word:
The word/term

PostgreSQL Full-Text Statistics Collection

PostgreSQL supports full-text retrieval (tsvector). You can use the ts_stat function to calculate the total number of occurrences of a word and the number of documents that contain the word in a table.

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.    

ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record

sqlquery is a text value containing an SQL query which must return a single tsvector column.

ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data.

The columns returned are

word text — the value of a lexeme

ndoc integer — number of documents (tsvectors) the word occurred in

nentry integer — total number of occurrences of the word

If weights is supplied, only occurrences having one of those weights are counted.

For example, to find the ten most frequent words in a document collection:
SELECT * FROM ts_stat('SELECT vector FROM apod')    
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
The same, but counting only word occurrences with weight A or B:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
postgres=# SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word);    
word
---------------------
'a' 'b' 'c'
'b':1,4 'c':2 'd':3
(2 rows)

postgres=# SELECT * FROM ts_stat($$SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word)$$);
word | ndoc | nentry
------+------+--------
d | 1 | 1
c | 2 | 2
b | 2 | 3
a | 1 | 1
(4 rows)

How to Train and Generate IDFs in a Database

Calculate the IDFs of all words, including stop words. Create a test table. Each table record contains a PK and a document.

create table doc(id int primary key, info text);    

postgres=# insert into doc values (1,'hi i am digoal');
INSERT 0 1

postgres=# insert into doc values (2,'hi i am abc');
INSERT 0 1
postgres=# select * from  ts_debug('a b c hello i am digoal');    
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+----------------+--------------+----------
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | b | {english_stem} | english_stem | {b}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | c | {english_stem} | english_stem | {c}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | i | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | am | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | digoal | {english_stem} | english_stem | {digoal}
(13 rows)
with t1 as (    
select count(*) as cnt from doc
),
t2 as (
select id, alias, token from
(
select id,(ts_debug(info)).* from doc
) t
group by id, alias, token
)
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;


token | alias | idf
--------+-----------+-------------------
| blank | 0
hi | asciiword | 0
abc | asciiword | 0.301029995663981
am | asciiword | 0
i | asciiword | 0
digoal | asciiword | 0.301029995663981
(6 rows)

Use PostgreSQL to Extract Keywords

The following steps describe how the keywords of each document are extracted:

set default_text_search_config='pg_catalog.english';  

select id, length(to_tsvector(info)) as cnt from doc;
select id, (ts_stat('select to_tsvector(info) from doc where id='||id)).* from doc;
ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
Ranks vectors based on the frequency of their matching lexemes.
normalization
0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length2 divides the rank by the document length4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)8 divides the rank by the number of unique words in document16 divides the rank by 1 + the logarithm of the number of unique words in document32 divides the rank by itself + 1
src/backend/utils/adt/tsrank.cDatum
ts_rank_ttf(PG_FUNCTION_ARGS)
{
TSVector txt = PG_GETARG_TSVECTOR(0);
TSQuery query = PG_GETARG_TSQUERY(1);
int method = PG_GETARG_INT32(2);
float res;
res = calc_rank(getWeights(NULL), txt, query, method); PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
PG_RETURN_FLOAT4(res);
}
Datum
ts_rank_tt(PG_FUNCTION_ARGS)
{
TSVector txt = PG_GETARG_TSVECTOR(0);
TSQuery query = PG_GETARG_TSQUERY(1);
float res;
res = calc_rank(getWeights(NULL), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
PG_RETURN_FLOAT4(res);
}
static float4
calc_rank_cd(const float4 *arrdata, TSVector txt, TSQuery query, int method)
{
DocRepresentation *doc;
int len,
i,
doclen = 0;
CoverExt ext;
double Wdoc = 0.0;
double invws[lengthof(weights)];
double SumDist = 0.0,
PrevExtPos = 0.0,
CurExtPos = 0.0;
int NExtent = 0;
QueryRepresentation qr;
for (i = 0; i < lengthof(weights); i++)
{
invws[i] = ((double) ((arrdata[i] >= 0) ? arrdata[i] : weights[i]));
if (invws[i] > 1.0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("weight out of range")));
invws[i] = 1.0 / invws[i];
}
qr.query = query;
qr.operandData = (QueryRepresentationOperand *)
palloc0(sizeof(QueryRepresentationOperand) * query->size);
doc = get_docrep(txt, &qr, &doclen);
if (!doc)
{
pfree(qr.operandData);
return 0.0;
}
MemSet(&ext, 0, sizeof(CoverExt));
while (Cover(doc, doclen, &qr, &ext))
{
double Cpos = 0.0;
double InvSum = 0.0;
int nNoise;
DocRepresentation *ptr = ext.begin;
while (ptr <= ext.end)
{
InvSum += invws[WEP_GETWEIGHT(ptr->pos)];
ptr++;
}
Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum; /*
* if doc are big enough then ext.q may be equal to ext.p due to limit
* of posional information. In this case we approximate number of
* noise word as half cover's length
*/
nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
if (nNoise < 0)
nNoise = (ext.end - ext.begin) / 2;
Wdoc += Cpos / ((double) (1 + nNoise));
CurExtPos = ((double) (ext.q + ext.p)) / 2.0;
if (NExtent > 0 && CurExtPos > PrevExtPos /* prevent devision by
* zero in a case of
multiple lexize */ )
SumDist += 1.0 / (CurExtPos - PrevExtPos);
PrevExtPos = CurExtPos;
NExtent++;
}
if ((method & RANK_NORM_LOGLENGTH) && txt->size > 0)
Wdoc /= log((double) (cnt_length(txt) + 1));
if (method & RANK_NORM_LENGTH)
{
len = cnt_length(txt);
if (len > 0)
Wdoc /= (double) len;
}
if ((method & RANK_NORM_EXTDIST) && NExtent > 0 && SumDist > 0)
Wdoc /= ((double) NExtent) / SumDist;
if ((method & RANK_NORM_UNIQ) && txt->size > 0)
Wdoc /= (double) (txt->size);
if ((method & RANK_NORM_LOGUNIQ) && txt->size > 0)
Wdoc /= log((double) (txt->size + 1)) / log(2.0);
if (method & RANK_NORM_RDIVRPLUS1)
Wdoc /= (Wdoc + 1);
pfree(doc); pfree(qr.operandData); return (float4) Wdoc;
}
with t1 as (    
select count(*) as cnt from doc
),
t2 as (
select id, alias, token from
(
select id,(ts_debug(info)).* from doc
) t
group by id, alias, token
)
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;
tf * idf

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.