Keyword Analysis with PostgreSQL: Algorithms for Text Analysis with the Smlar Plug-in

Image for post
Image for post

Background

Let us start with two examples to understand how does the rank calculation work on PostgreSQL.

Example 1

The following two records are available in the database.

Use the following query condition and run a query.

Clearly, both records match. However, the question here is which one should be displayed at the top?

Example 2

Now, search for “dog|chihuahua” in the search engine and see which record will be displayed at the top? Skip to the second page to avoid advertisements.

As shown in the preceding snapshot, records about Chihuahua are displayed first. You may ask, why?

It is due to the term frequency-inverse document frequency (TF-IDF) algorithm. In a global text, the occurrence frequency of the word dog is higher than that of Chihuahua. Therefore, according to the inverse document frequency (IDF) algorithm, the IDF (log (total number of texts/number of texts containing the word)) of the word ‘dog’ is lower than that of ‘Chihuahua’.

However, in case if TF is the same, the size of TF × IDF depends on the size of IDF. PostgreSQL databases support full-text search. But, do they support TF-IDF by default?

PostgreSQL Databases Do Not Use IDF to Calculate the Rank by Default

According to the code, the ts_rank and ts_rank_cd functions of PostgreSQL do not take IDF into consideration.

People across social forums are also asking the same question. You can read more about this here

PostgreSQL’s built-in rank algorithm does not care about the IDF, instead only calculates the word frequency in the current text.

Currently, PostgreSQL uses ts_rank and ts_rank_cd to calculate the correlation between tsquery and tsvector. For more details about the algorithm, refer to the previous blog in this series Keyword Analysis with PostgreSQL: Cosine and Linear Correlation Algorithms for Text Analysis.

Introduction to the smlar Plug-in

The smlar plug-in supports multiple similarity calculation formulas (algorithms), including cosine (default), TF-IDF, and overlap. It also allows you to customize a formula or function to calculate the similarity.

Function API

1. Calculate the similarity between arrays.

2. Calculate the similarity between custom composite arrays (elements and weights). You can select to calculate all elements or only the overlapped elements. When the weights are different, the similarity result is different, for example, in the cosine algorithm.

Alternatively, you can take the array as a weighted array that contains TD-IDF, such as [(‘China’, 0.1), (‘Japan ‘, 0.1), (‘pirate’, 0.9)]. It is useful for text similarity analysis.

Consider the following example:

Note: GiST doesn’t support composite weighted type.

3. Use a custom formula that contains three variables, namely, N.i, N.a, and N.b, to calculate the similarity between arrays. You can also customize the algorithm.

Operator API

Determine whether both arrays are similar. Note, when the similarity value is greater than the threshold, the threshold is specified by the smlar.threshold parameter).

Conversion Function API

Follow the steps below to implement the conversion function API:

Step 1. Convert the tsvector-type data to the text array.

Step 2. Sort the elements in the array and remove repeated elements.

Step 3. Determine whether the array contains a specific element. If yes, 1.0 is returned. If not, 0 is returned.

Step 4. Similar to the ternary operator, determine whether the array contains a specific element. If yes, the value of the third parameter is returned. If not, the value of the fourth parameter is returned.

Parameter

Implement the following steps to set the parameters:

Step 1. Specify the similarity threshold. When the similarity is lower than the threshold, ‘False’ is returned for the % operator, which is used to calculate whether two arrays are similar.

Step 2. Specify whether to persist in the IDF table.

Step 3. Specify the default similarity algorithm.

The preceding command is source code, where the IDF statistics table is used during TF-IDF calculation.

For more information about algorithms, refer the document, Similarity Algorithms — Effective Similarity Search in PostgreSQL.

Step 4. If the similarity algorithm is TF-IDF, set this parameter and use a statistics table to record the IDF of each word and the total number of texts.

Step 5. If the similarity algorithm is TF-IDF, set this parameter as shown below. TF calculation method settings: frequency of occurrence, 1 + log (frequency of occurrence), and constant 1.

Note: GIN index supports only smlar.tf_method = “const””

Step 6. If the similarity algorithm is TF-IDF, set this parameter as below:

Whether to obtain the log by adding 1 to the IDF.

Recommended Parameter Configuration

Index OPS Class

The core function of the smlar plug-in is to calculate the similarity between arrays of any type. It also supports indexing to increase speed. The OPSs are required when you create an index. Different types of arrays are listed below for your reference:

How to Calculate Similarity in TF-IDF Mode by Using smlar: An Example

The following steps describes how to use TF-IDF.

Step 1. Install the smlar plug-in.

Step 2. Now, create a test table.

Step 3. Import test data. You can visit this site and select to generate the test data.

Or execute the following commands:

Step 4. Create a statistical table to record the number of texts that contain each word and the total number of texts. For daily use, you can export phrases and IDF from the dictionary, generate the total number of texts and the occurrence frequency of words, and import the information to the table.

For example, you can import this information from SCWS, Jieba, and other dictionaries to the following table.

Step 5. Use local data to generate IDF. You can skip this step in actual practice. It is recommended to import IDF from a dictionary.

Step 6. Now, insert the number of texts.

Step 7. Create a function that converts the text to a text array.

Note: Duplicate values are lost when tsvector is converted to an array.

Step 8. Next, create an expression index.

Step 9. Set parameters and use the TF-IDF formula to calculate the array similarity.

Step 10. Lastly, query the records. When the TF-IDF similarity is equal to or greater than the value of smlar.threshold, records are returned.

You can increase the similarity threshold to exclude more records and improve query efficiency.

IDF of MADlib Training Text

Basic steps for clustering documents may vary a bit depending on your specific goals. However, the following example shows a standard scenario:

  1. Often there is an initial first level processing to handle tokenization, word stemming, filter stopwords, etc. There is no right or wrong way to handle this step and there is a fair amount of art to doing it well.
  2. Next, build feature vectors for each document. There are a variety of methods for handling this, one common example would be to build tf-idfs. For this, you need the following metrics:

1. Word count for each document

2. Documents count.

3. Having produced a feature vector for your documents you can then call the kmeans function in MADlib to perform the actual clustering

Given a tf-idf feature vector, the most common distance function is cosine similarity, though other distance functions may make sense depending on your use case. RUM also supports TF-IDF.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store