Keyword Analysis with PostgreSQL: Algorithms for Text Analysis with RUM and Smlar

Background

In our previous blogs, we have introduced the TF-IDF algorithm and its application in text analysis (keyword extraction) with PostgreSQL.

In addition to document keyword extraction, the TF-IDF algorithm is also used to compare the similarity between documents. For example, multiple pieces of similar news are provided under the main Google news by the search engine.

So, now you may ask, how do we find similar documents?

The following are some similarity scenarios.

  • Different articles with the same quotes.
  • Homework and exam paper with the same answers.
  • Different articles with the same theme, for example, spring outing.

You can use different methods to find the similarity in the preceding scenarios. Alternatively, different algorithms can be used to improve efficiency in such scenarios.

1. For similar content, including same words or terms in different documents, use full-text retrieval. For example, search for documents that contain the same quote. In PostgreSQL, use the tsvector, tsquery, GiST, GIN, or RUM index for efficient full-text retrieval.
2. For duplicate documents, search by using the B-tree or hash index of the document hash value.

3. For different documents with the same theme such as spring outing, which may contain the same things and scenes, we recommend that you use the relative TF. Relative TF is the number of occurrences of a word in a document divided by the total number of words in a document to prevent the difference caused by the document size.

To calculate the document similarity in the third scenario, you only need to calculate the keyword similarity. The following sections describe several similarity algorithms.

1. Linear Correlation

In 1801, Italian astronomer Giuseppe Piazzi discovered the first asteroid Ceres. After 40 days of tracking and observation, Giuseppe Piazzi lost the Ceres’s position because the Ceres run to the rear of the sun. Global scientists tried to find the Ceres based on observation data provided by Giuseppe Piazzi. However, Ceres was not found based on most people’s computing results. Gauss also calculated the Ceres’s track. Based on the track calculated by Gauss, astronomer Heinrich Wilhelm Olbers found Ceres.

This is an example of a linear regression analysis that uses the least-squares fitting technique. For more background information, visit the following webpages:

In addition to astronomy, linear regression analysis also applies to finance, weather, and disease forecast.

It is important to understand the relationship between linear regression and document similarity analysis. In my opinion, linear regression can be used to calculate the correlation between similar TF-IDF vectors and then to determine the similarity. A correlation value closer to 1 indicates more similar vectors.

Consider the following example.

Since here the keywords do not completely overlap, you can use either of the following correction methods.

1. Calculate the correlation between intersected keywords and use a weight for correction. For example, set a weight based on the ratio of intersection score or the number of elements.

The intersection linear correlation is calculated based on the linear correlation of the following groups of values, which is 1.

2. Obtain the intersection of keywords in these two documents and obtain the TF-IDF value of each word in corresponding documents to produce two groups of vectors. Then, calculate the correlation between both groups of vectors. Following example illustrates this:

The SQL statement for the same is as follows.

3. If you find it difficult to obtain the TF-IDF values of missing keywords in a union, use 0 to substitute them.

The simplest method is to obtain the intersection correlation and then use a weight. It is also simple to use an index to accelerate the speed.

2. Document Similarity Algorithm: TF-IDF Cosine Similarity of Top N Words

In addition to linear correlation, you can also use the cosine similarity algorithm to calculate document similarity.

In the cosine similarity algorithm, the TF-IDF vectors of keywords are extracted, and the intersection of two groups of vectors and the distance between TF-IDF vectors are calculated. Based on the ratio of the intersection value or the number of elements, weight is set to correct the similarity. For more background information, please refer to the following page http://en.wikipedia.org/wiki/Cosine_similarity

To simplify the operation, we start from sentences.

So, in this case, how can we calculate the similarity between both sentences?

The basic idea here is that two sentences that use similar words are similar. Therefore, calculate their similarity based on the TF value.

Step 1: Separate words.

Step 2: List all words.

Step 3: Calculate the TF value.

Step 4: Put down the TF vectors based on the keywords.

Here, the similarity between vectors is calculated. If vector a is [x1, y1] and vector B is [x2, y2], you can change the cosine formula to the following form:

Mathematicians have proved that this cosine calculation method also applies to n-dimensional vectors. If A and B are both n-dimensional vectors and A is [A1, A2, …, An] and B is [B1, B2, …, Bn], the cosine value of angle (θ) between A and B is calculated as follows.

You can use this formula to obtain the cosine value of the angle between sentences A and B.

A cosine value closer to 1 indicates a closer angle to 0° and more similar vectors. That is the cosine similarity. The preceding sentences A and B are similar, and their angle is about 20.3°.

Therefore, we can obtain an algorithm to find out similar documents.

Cosine similarity is a useful algorithm for calculating the similarity between vectors.

TF-IDF Algorithm Application for Document Similarity in Databases

1. The smlar Plug-in

We’ve discussed how to use the smlar plug in for similarity analysis in the previous blog.

Its principle is similar to that of the cosine similarity algorithm. It also allows users to customize the similarity formula.

The smlar plug-in only supports the GIN and GiST indexes for built-in data arrays. This implies that it cannot retrieve custom arrays by using indexes.

Consider the following example.

Summary

Currently, the smlar plug-in is more suitable for the similarity retrieval of non-compound arrays (built-in arrays), such as the similarity retrieval of data with fixed attributes. For example, when 1,000 attributes are abstracted, each row represents an object and stores the score of each attribute of the object.

For example, find similar user groups based on the tag vector similarity.

Another example could be to find the similarity between images of the same size by comparing on the basis of pixel RGB value vectors.

The smlar plug-in also supports the similarity comparison of other types of built-in arrays. However, elements do not have weights which imply that the TF-IDF calculation is temporarily unavailable.

Note: If we expand the smlar plug-in to enable its compound arrays to support the GIN and GiST indexes, its applicable scope will be wider.

During application, extract keywords and calculate the TF-IDF as the compound vector arrays and store array content to the database. Calculate the similarity between two groups of vectors in the database. This feature is powerful because it can be accelerated by using an index.

2. The RUM Plug-in

Calculating document similarity based on keywords is a simple method with low accuracy. In addition to keyword-based similarity comparison, the PostgreSQL RUM plug-in supports the similarity comparison of an entire document.

The method for comparing the similarity of an entire document is quite common and will not be described in this article..

The document similarity comparison is more accurate. However, compared with keyword-based similarity comparison, document similarity comparison needs to store more words and consume more storage space, resulting in decreased efficiency.

The following describes how to use the RUM plug-in and keywords to calculate the similarity.

Tsvector does not store the TF-IDF value, it stores the TF value instead.

The following shows the table structure. (It is assumed that you have stored data in the preceding format to the info field of the tbl_test table.)

Now, query the similarity (rank) between documents using the RUM plug-in. The input is tsquery, which is spliced by “|”.

Note: Because tsquery does not have weight information, you cannot enter the weights or TFs. Currently, the RUM plug-in does not calculate ts_rank based on the cosine similarity algorithm.

Although tsquery does not support weights, you can enter duplicate values to increase the weight of a word.

Increase the weight of “place”.

RUM also supports different rank algorithms. The following masks can be summed as one integer, ranging from 1 to 32.

For more information about algorithm differences after the masks are added, see the rum/src/rum_ts_utils.c code.

The following shows a query based on a custom method.

You can also use the traditional PostgreSQL for ranking. However, traditional ranking does not support indexing.

It is complex to enable the RUM plug-in to support the usage of the keywords, which has been mentioned in the TODO file of the RUM plug-in. TF-IDF will be introduced to simplify the operation.

Performance test (10 million test data records with 33 keywords in each record).

Summary

Both the smlar and RUM plug-ins have defects in querying the similarity between keyword vectors. The smlar plug-in is more suitable for querying the similarity between arrays with fixed attribute values. The RUM algorithm is not a cosine vectorization algorithm though it supports similarity query.

As a personal opinion, I prefer the smlar plug-in. You can store keywords in arrays as follows without adding the TF-IDF weight.

A larger threshold indicates a higher similarity, stricter requirements, fewer matched records, and a faster speed.

Following is an example of an int8 array.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store