Keyword Analysis with PostgreSQL: Cosine and Linear Correlation Algorithms for Text Analysis

Image for post
Image for post

Background

Usually, websites have tag functionality and automatically generate tags based on content on a web page. In simple words, tags are keywords on a web page, for example, a web page to sell mobile phones has tags related to mobile phone categories and features. Now, you must be wondering how are tags generated?

Before we answer this question, first, let’s answer some frequently asked questions related to keywords.

Is it true to state that if a word occurs more frequently in a document, it is more likely to be a keyword?

The answer is no. For example, words like “of”, “be”, “I”, and “you” frequently occur in a document. However, they are apparently not keywords but stop words that are negligible.

Also, it is important to understand that there are some words that are neither stop words nor keywords. For example, words such as “application”, “science”, and “Mandarin” may also frequently occur in documents, but they may not be keywords.

Now, the other critical question is whether there are any algorithms that can extract keywords from a document? This article describes the application of PostgreSQL along with Cosine and Linear Correlation Algorithms for text (keyword) analysis.

Term Frequency-Inverse Document Frequency Algorithm

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical method for evaluating the importance of a word or term for a document in a collection or corpus. The importance of a word or term increases in a document but decreases in the corpus when its occurrence frequency is high.

Search engines often use TF-IDF weights to measure or grade the correlation between documents and user queries. In addition to TF-IDF, search engines on the Internet also use grading based on link analysis to determine the occurrence sequence of documents in search results.

  • TF is easy to understand, that is the frequency of a word in a document.
  • IDF is a weight for measuring the universal importance of a word. A commoner word has a lower IDF.
  • To obtain the IDF of a word, divide the number of documents that contain the word by the total number of documents to obtain the logarithm.
  • To obtain the importance of a word in a document, multiply TF and IDF. The TF-IDF algorithm filters out common words and retains important words.

Examples

Example 1

Many mathematical formulas are used to calculate TF-IDF value. This example uses the following mathematical formulas.

1. TF = Number of occurrences of a word in a document/Total number of words in the document

For example, if a document has 100 words in total and the word “cow” occurs three times, the TF of “cow” is 0.03, that is, 3/100.

2. IDF = Total number of documents contained in a collection/Number of documents that contain “cow”

For example, if “cow” occurs in 1,000 of 10,000,000 documents, the IDF of “cow” is 4, that is, log(10,000,000/1,000).

3. Then, the TF-IDF of “cow” is 0.12, that is, 0.03 x 4.

A larger TF-IDF value of a word indicates that the word is more important in a document.

Example 2

On a web page with 1,000 words, the word “atomic energy”, “of”, and “application” occur 2, 35, and 5 times, respectively. Then, their TF values are 0.002, 0.035, and 0.005, respectively.

The sum of the three values (0.042) is a simple measurement for the correlation between the web page and the query for the “application of atomic energy”.

In short, if a query contains keywords w1, w2, … and wN and their TF values on a specific web page are TF1, TF2, …, and TFN, the correlation between the query and the web page is TF1 + TF2 + … + TFN.

However, you may find a bug here. In the preceding example, the TF of “of” occupies 80% of the total TF value. However, it is almost useless for determining the topic of the web page. Such words are called stop words. In other words, when measuring the correlation, you should not consider their frequencies. In Chinese, “be”, “and”, and “in” are also stop words.

Once you move ahead of these stop words, the correlation between the preceding web page and the query becomes 0.007, in which case “atomic energy” occupies 0.002 and “application” occupies 0.005.

Also, you may find another bug. In Chinese, “application” is a common word while “atomic energy” is a professional term. The correlation of “atomic energy” is more important than that of “application”. Therefore, we need to provide a weight (IDF) for each Chinese word. The weight must meet the following conditions:

1. A word with stronger topic prediction capability has a higher weight. When we see “atomic energy” on a web page, we can relate to the web page topic, more or less. However, when we see “application”, we basically have no idea about the topic. Therefore, the weight of “atomic energy” should be higher than that of “application”.

2. The weight of stop words should be zero. If a keyword seldom occurs on web pages and we can use it to easily locate the search objective, its weight should be high. If a keyword occurs on a large number of web pages and we do not know the search objective based on it, its weight should be less. In short, if a keyword w occurs on Dw web pages, a larger Dw value indicates a lower weight of w. During information retrieval, the commonest weight is the IDF, which is calculated by using log(D/Dw). Wherein, D indicates the total number of web pages, and Dw indicates the number of web pages that contain a specific word.

For example, if the number of web pages is 1 billion and the stop word “of” occurs on all web pages, then the Dw value is 1 billion and the IDF value is 0, that is, log(1 billion/1 billion).

If the professional term “atomic energy” occurs on 2 million web pages, then, the Dw value is 2 million and the IDF value is 2.7, that is, log(500). If the common word “application” occurs on 500 million web pages, then the IDF value is 0.3, that is, log(2).

The preceding example shows that one “atomic energy” matching result is equal to nine “application” matching results on web pages. After the IDF value is used, the preceding correlation calculation formula changes from the sum of TF values to the sum of weights, that is,

TF1 x IDF1 + TF2 x IDF2 + …, + TFN x IDFN

In the preceding example, the correlation between the web page and the query of “application of atomic energy” is 0.0069, wherein “atomic energy” occupies 0.0054 and “application” occupies 0.0015. These ratios turned out to be consistent with our intuition.

Lexicon

Lexicon and IDF

Lexicons are used in input methods. For example, if you are involved in the IT industry, you may use multiple IT industry terms to communicate with your colleagues. Phrases in a lexicon can be used for word-breaking. The IDF of a word in lexicons for different industries should be different because their collections are different. For example, there are 10 billion documents for the IT industry and 1 billion documents for the medical industry. Will the IDF of “biology” in lexicons for both industries be the same?

IDF Calculation

Similar to the census, you can perform sampling or network-wide computing to calculate the IDF value. Since search engines browse content on various websites every day, their content may be humungous. The IDF value calculated on the basis of the search engine content is accurate. However, the industry-specific IDF value needs to be calculated to extract accurate keywords in the respective verticals.

MADlib TF Measurement API

MADlib is an open-source machine learning library developed by Pivotal and Berkeley University. It can be used in the Greenplum, PostgreSQL, HAWQ, and other databases.

MADlib provides a TF measurement API.

The content is as follows. You can call the term_frequency function to collect statistics on the tables that store documents and obtain the TFs of all words in each record.

Term frequency tf(t,d) is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t occurs in document d.

For this function, ‘word’ and ‘term’ are used interchangeably.

Note: the term frequency is not normalized by the document length.

The following describes the parameters.

PostgreSQL Full-Text Statistics Collection

PostgreSQL supports full-text retrieval (tsvector). You can use the ts_stat function to calculate the total number of occurrences of a word and the number of documents that contain the word in a table.

The ts_stat function is used to extract keywords in a table or check word-breaking validity. If a document contains a large number of stop words, the word-breaking effect is poor.

You can use the ts_stat function as follows.

Check the following example to understand this aspect better.

Refer the example below.

How to Train and Generate IDFs in a Database

Calculate the IDFs of all words, including stop words. Create a test table. Each table record contains a PK and a document.

Use the corresponding word-breaking configuration (ts_config) to break words for each document and calculate the IDFs of words in the table. A larger number of records indicates a more accurate IDF (similar to document training).

Also, use the ts_debug function to obtain the token type of each word.

The following table lists the token types contained in the default parser.

Image for post
Image for post
Image for post
Image for post

Now, calculate the IDF value.

Use PostgreSQL to Extract Keywords

The following steps describe how the keywords of each document are extracted:

Step 1. To calculate the TF value, count the number of words in each record. Assume that one record maps one document.

Count the number of times each word occurs in each document.

There is another method to calculate the TF value.

Refer to the following source code.

Step 2. Calculate the IDF value as shown below.

Step 3. Now calculate the TF-IDF of each word.

Step 4. Lastly, write the preceding logic to a function to extract the top N keywords of a document based on the (TF x IDF) value.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store