Keyword Analysis with PostgreSQL: Cosine and Linear Correlation Algorithms for Text Analysis

Background

Usually, websites have tag functionality and automatically generate tags based on content on a web page. In simple words, tags are keywords on a web page, for example, a web page to sell mobile phones has tags related to mobile phone categories and features. Now, you must be wondering how are tags generated?

Before we answer this question, first, let’s answer some frequently asked questions related to keywords.

Is it true to state that if a word occurs more frequently in a document, it is more likely to be a keyword?

The answer is no. For example, words like “of”, “be”, “I”, and “you” frequently occur in a document. However, they are apparently not keywords but stop words that are negligible.

Also, it is important to understand that there are some words that are neither stop words nor keywords. For example, words such as “application”, “science”, and “Mandarin” may also frequently occur in documents, but they may not be keywords.

Now, the other critical question is whether there are any algorithms that can extract keywords from a document? This article describes the application of PostgreSQL along with Cosine and Linear Correlation Algorithms for text (keyword) analysis.

Term Frequency-Inverse Document Frequency Algorithm

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical method for evaluating the importance of a word or term for a document in a collection or corpus. The importance of a word or term increases in a document but decreases in the corpus when its occurrence frequency is high.

Search engines often use TF-IDF weights to measure or grade the correlation between documents and user queries. In addition to TF-IDF, search engines on the Internet also use grading based on link analysis to determine the occurrence sequence of documents in search results.

  • TF is easy to understand, that is the frequency of a word in a document.

Examples

Example 1

Many mathematical formulas are used to calculate TF-IDF value. This example uses the following mathematical formulas.

1. TF = Number of occurrences of a word in a document/Total number of words in the document

For example, if a document has 100 words in total and the word “cow” occurs three times, the TF of “cow” is 0.03, that is, 3/100.

2. IDF = Total number of documents contained in a collection/Number of documents that contain “cow”

For example, if “cow” occurs in 1,000 of 10,000,000 documents, the IDF of “cow” is 4, that is, log(10,000,000/1,000).

3. Then, the TF-IDF of “cow” is 0.12, that is, 0.03 x 4.

A larger TF-IDF value of a word indicates that the word is more important in a document.

Example 2

On a web page with 1,000 words, the word “atomic energy”, “of”, and “application” occur 2, 35, and 5 times, respectively. Then, their TF values are 0.002, 0.035, and 0.005, respectively.

The sum of the three values (0.042) is a simple measurement for the correlation between the web page and the query for the “application of atomic energy”.

In short, if a query contains keywords w1, w2, … and wN and their TF values on a specific web page are TF1, TF2, …, and TFN, the correlation between the query and the web page is TF1 + TF2 + … + TFN.

However, you may find a bug here. In the preceding example, the TF of “of” occupies 80% of the total TF value. However, it is almost useless for determining the topic of the web page. Such words are called stop words. In other words, when measuring the correlation, you should not consider their frequencies. In Chinese, “be”, “and”, and “in” are also stop words.

Once you move ahead of these stop words, the correlation between the preceding web page and the query becomes 0.007, in which case “atomic energy” occupies 0.002 and “application” occupies 0.005.

Also, you may find another bug. In Chinese, “application” is a common word while “atomic energy” is a professional term. The correlation of “atomic energy” is more important than that of “application”. Therefore, we need to provide a weight (IDF) for each Chinese word. The weight must meet the following conditions:

1. A word with stronger topic prediction capability has a higher weight. When we see “atomic energy” on a web page, we can relate to the web page topic, more or less. However, when we see “application”, we basically have no idea about the topic. Therefore, the weight of “atomic energy” should be higher than that of “application”.

2. The weight of stop words should be zero. If a keyword seldom occurs on web pages and we can use it to easily locate the search objective, its weight should be high. If a keyword occurs on a large number of web pages and we do not know the search objective based on it, its weight should be less. In short, if a keyword w occurs on Dw web pages, a larger Dw value indicates a lower weight of w. During information retrieval, the commonest weight is the IDF, which is calculated by using log(D/Dw). Wherein, D indicates the total number of web pages, and Dw indicates the number of web pages that contain a specific word.

For example, if the number of web pages is 1 billion and the stop word “of” occurs on all web pages, then the Dw value is 1 billion and the IDF value is 0, that is, log(1 billion/1 billion).

If the professional term “atomic energy” occurs on 2 million web pages, then, the Dw value is 2 million and the IDF value is 2.7, that is, log(500). If the common word “application” occurs on 500 million web pages, then the IDF value is 0.3, that is, log(2).

The preceding example shows that one “atomic energy” matching result is equal to nine “application” matching results on web pages. After the IDF value is used, the preceding correlation calculation formula changes from the sum of TF values to the sum of weights, that is,

TF1 x IDF1 + TF2 x IDF2 + …, + TFN x IDFN

In the preceding example, the correlation between the web page and the query of “application of atomic energy” is 0.0069, wherein “atomic energy” occupies 0.0054 and “application” occupies 0.0015. These ratios turned out to be consistent with our intuition.

Lexicon

Lexicon and IDF

Lexicons are used in input methods. For example, if you are involved in the IT industry, you may use multiple IT industry terms to communicate with your colleagues. Phrases in a lexicon can be used for word-breaking. The IDF of a word in lexicons for different industries should be different because their collections are different. For example, there are 10 billion documents for the IT industry and 1 billion documents for the medical industry. Will the IDF of “biology” in lexicons for both industries be the same?

IDF Calculation

Similar to the census, you can perform sampling or network-wide computing to calculate the IDF value. Since search engines browse content on various websites every day, their content may be humungous. The IDF value calculated on the basis of the search engine content is accurate. However, the industry-specific IDF value needs to be calculated to extract accurate keywords in the respective verticals.

MADlib TF Measurement API

MADlib is an open-source machine learning library developed by Pivotal and Berkeley University. It can be used in the Greenplum, PostgreSQL, HAWQ, and other databases.

MADlib provides a TF measurement API.

The content is as follows. You can call the term_frequency function to collect statistics on the tables that store documents and obtain the TFs of all words in each record.

Term frequency tf(t,d) is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t occurs in document d.

For this function, ‘word’ and ‘term’ are used interchangeably.

Note: the term frequency is not normalized by the document length.

term_frequency(input_table,    
doc_id_col,
word_col,
output_table,
compute_vocab)

The following describes the parameters.

Arguments:    

input_table
TEXT.
The name of the table storing the documents.
Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document.
The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.

id_col
TEXT.
The name of the column containing the document id.

word_col
TEXT.
The name of the column containing the vector of words/terms in the document.
This column should of type that can be cast to TEXT[], 例如tsvector.

output_table
TEXT.
The name of the table to store the term frequency output.
The output table contains the following columns:
id_col:
This the document id column (same as the one provided as input).
word:
A word/term present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).
count:
The number of times this word is found in the document.
compute_vocab:
BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created.
If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word.
The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:
wordid:
An id assignment for each word
word:
The word/term

PostgreSQL Full-Text Statistics Collection

PostgreSQL supports full-text retrieval (tsvector). You can use the ts_stat function to calculate the total number of occurrences of a word and the number of documents that contain the word in a table.

The ts_stat function is used to extract keywords in a table or check word-breaking validity. If a document contains a large number of stop words, the word-breaking effect is poor.

You can use the ts_stat function as follows.

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.    

ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record

sqlquery is a text value containing an SQL query which must return a single tsvector column.

ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data.

The columns returned are

word text — the value of a lexeme

ndoc integer — number of documents (tsvectors) the word occurred in

nentry integer — total number of occurrences of the word

If weights is supplied, only occurrences having one of those weights are counted.

For example, to find the ten most frequent words in a document collection:

Check the following example to understand this aspect better.

SELECT * FROM ts_stat('SELECT vector FROM apod')    
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
The same, but counting only word occurrences with weight A or B:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

Refer the example below.

postgres=# SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word);    
word
---------------------
'a' 'b' 'c'
'b':1,4 'c':2 'd':3
(2 rows)

postgres=# SELECT * FROM ts_stat($$SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word)$$);
word | ndoc | nentry
------+------+--------
d | 1 | 1
c | 2 | 2
b | 2 | 3
a | 1 | 1
(4 rows)

How to Train and Generate IDFs in a Database

Calculate the IDFs of all words, including stop words. Create a test table. Each table record contains a PK and a document.

create table doc(id int primary key, info text);    

postgres=# insert into doc values (1,'hi i am digoal');
INSERT 0 1

postgres=# insert into doc values (2,'hi i am abc');
INSERT 0 1

Use the corresponding word-breaking configuration (ts_config) to break words for each document and calculate the IDFs of words in the table. A larger number of records indicates a more accurate IDF (similar to document training).

Also, use the ts_debug function to obtain the token type of each word.

postgres=# select * from  ts_debug('a b c hello i am digoal');    
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+----------------+--------------+----------
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | b | {english_stem} | english_stem | {b}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | c | {english_stem} | english_stem | {c}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | i | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | am | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | digoal | {english_stem} | english_stem | {digoal}
(13 rows)

The following table lists the token types contained in the default parser.

Now, calculate the IDF value.

with t1 as (    
select count(*) as cnt from doc
),
t2 as (
select id, alias, token from
(
select id,(ts_debug(info)).* from doc
) t
group by id, alias, token
)
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;


token | alias | idf
--------+-----------+-------------------
| blank | 0
hi | asciiword | 0
abc | asciiword | 0.301029995663981
am | asciiword | 0
i | asciiword | 0
digoal | asciiword | 0.301029995663981
(6 rows)

Use PostgreSQL to Extract Keywords

The following steps describe how the keywords of each document are extracted:

Step 1. To calculate the TF value, count the number of words in each record. Assume that one record maps one document.

set default_text_search_config='pg_catalog.english';  

select id, length(to_tsvector(info)) as cnt from doc;

Count the number of times each word occurs in each document.

select id, (ts_stat('select to_tsvector(info) from doc where id='||id)).* from doc;

There is another method to calculate the TF value.

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
Ranks vectors based on the frequency of their matching lexemes.
normalization
0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length2 divides the rank by the document length4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)8 divides the rank by the number of unique words in document16 divides the rank by 1 + the logarithm of the number of unique words in document32 divides the rank by itself + 1

Refer to the following source code.

src/backend/utils/adt/tsrank.cDatum
ts_rank_ttf(PG_FUNCTION_ARGS)
{
TSVector txt = PG_GETARG_TSVECTOR(0);
TSQuery query = PG_GETARG_TSQUERY(1);
int method = PG_GETARG_INT32(2);
float res;
res = calc_rank(getWeights(NULL), txt, query, method); PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
PG_RETURN_FLOAT4(res);
}
Datum
ts_rank_tt(PG_FUNCTION_ARGS)
{
TSVector txt = PG_GETARG_TSVECTOR(0);
TSQuery query = PG_GETARG_TSQUERY(1);
float res;
res = calc_rank(getWeights(NULL), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
PG_RETURN_FLOAT4(res);
}
static float4
calc_rank_cd(const float4 *arrdata, TSVector txt, TSQuery query, int method)
{
DocRepresentation *doc;
int len,
i,
doclen = 0;
CoverExt ext;
double Wdoc = 0.0;
double invws[lengthof(weights)];
double SumDist = 0.0,
PrevExtPos = 0.0,
CurExtPos = 0.0;
int NExtent = 0;
QueryRepresentation qr;
for (i = 0; i < lengthof(weights); i++)
{
invws[i] = ((double) ((arrdata[i] >= 0) ? arrdata[i] : weights[i]));
if (invws[i] > 1.0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("weight out of range")));
invws[i] = 1.0 / invws[i];
}
qr.query = query;
qr.operandData = (QueryRepresentationOperand *)
palloc0(sizeof(QueryRepresentationOperand) * query->size);
doc = get_docrep(txt, &qr, &doclen);
if (!doc)
{
pfree(qr.operandData);
return 0.0;
}
MemSet(&ext, 0, sizeof(CoverExt));
while (Cover(doc, doclen, &qr, &ext))
{
double Cpos = 0.0;
double InvSum = 0.0;
int nNoise;
DocRepresentation *ptr = ext.begin;
while (ptr <= ext.end)
{
InvSum += invws[WEP_GETWEIGHT(ptr->pos)];
ptr++;
}
Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum; /*
* if doc are big enough then ext.q may be equal to ext.p due to limit
* of posional information. In this case we approximate number of
* noise word as half cover's length
*/
nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
if (nNoise < 0)
nNoise = (ext.end - ext.begin) / 2;
Wdoc += Cpos / ((double) (1 + nNoise));
CurExtPos = ((double) (ext.q + ext.p)) / 2.0;
if (NExtent > 0 && CurExtPos > PrevExtPos /* prevent devision by
* zero in a case of
multiple lexize */ )
SumDist += 1.0 / (CurExtPos - PrevExtPos);
PrevExtPos = CurExtPos;
NExtent++;
}
if ((method & RANK_NORM_LOGLENGTH) && txt->size > 0)
Wdoc /= log((double) (cnt_length(txt) + 1));
if (method & RANK_NORM_LENGTH)
{
len = cnt_length(txt);
if (len > 0)
Wdoc /= (double) len;
}
if ((method & RANK_NORM_EXTDIST) && NExtent > 0 && SumDist > 0)
Wdoc /= ((double) NExtent) / SumDist;
if ((method & RANK_NORM_UNIQ) && txt->size > 0)
Wdoc /= (double) (txt->size);
if ((method & RANK_NORM_LOGUNIQ) && txt->size > 0)
Wdoc /= log((double) (txt->size + 1)) / log(2.0);
if (method & RANK_NORM_RDIVRPLUS1)
Wdoc /= (Wdoc + 1);
pfree(doc); pfree(qr.operandData); return (float4) Wdoc;
}

Step 2. Calculate the IDF value as shown below.

with t1 as (    
select count(*) as cnt from doc
),
t2 as (
select id, alias, token from
(
select id,(ts_debug(info)).* from doc
) t
group by id, alias, token
)
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;

Step 3. Now calculate the TF-IDF of each word.

tf * idf

Step 4. Lastly, write the preceding logic to a function to extract the top N keywords of a document based on the (TF x IDF) value.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.