How to Retrieve and Deduplicate Similar Text in PostgreSQL

Image for post
Image for post

Background

This article seeks to address the following list of questions raised in the Q&A section of the Alibaba Cloud Community.

How can I delete similar data from a table?

For example, similar strings such as the following:

  • How to do it?
  • How to perform it?
  • How can I do it?
  • What should I do in this case?

How to Solve the Problem

Follow the steps below to resolve the issues listed above.

1) Use the pg_trgm word breaker provided by PostgreSQL to divide strings into multiple non-repeated tokens and determine the content similarity between two strings. We recommend using the word breaking method to divide the content into terms.
2) After dividing the content into terms, aggregate the terms and remove duplicate data.
3) Generate a Cartesian product (matrix) through self-association to calculate the similarity between each record and each other record. To calculate the similarity, simply divide the number of overlapping tokens by the number of tokens in the set after deduplication.
4) Remove unnecessary data based on similarity.

While dealing with huge data volumes, use a professional analytical programming language, such as PL/R.

Practical Example

First, install the PostgreSQL word breaker (AliCloudDB for PostgreSQL contains this plug-in. For more information, see the official manual.)

Create a plug-in as shown below.

Next, create a test case.

Now, choose one of the word breakers as shown below.

Create the following three functions:

  • To calculate the set (after deduplication) of two arrays
  • To deduplicate the arrays
  • To calculate the overlapping parts (after deduplication) of the two arrays

The Cartesian result is as follows:

regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ') used to convert info into arrays

The preceding result is actually a matrix, where simulate indicates the similarity to be calculated.

Image for post
Image for post

During deduplication, we only need the data in the upper or lower part of the diagonal of the matrix. Therefore, add one more condition to determine the similarity.

First, specify the simulate. For example, data with a similarity greater than 0.5 needs to be deduplicated.

Second, delete the records corresponding to the IDs in column t2c1.

Use a database to solve the problems that you may encounter because the JOIN filter is <> and <, while the hashjoin is not used.

This operation usually takes longer for large amounts of data.

A better way is to use PL/R or R for matrix calculation and then filtering the result. You can also use an MPP database. For example, Greenplum processes a large amount of data with R and MADlib.

Summary

The article illustrates how to retrieve and deduplicate similar text in PostgreSQL by using examples. It applies Word Breaking functionality of PG to answer the questions regarding the deduplication of similar text. The other feature of PG, Window Query, is not used in this example. However, if your data does not have a primary key, use ctid and row_number to locate a unique record.

References

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store