How to Optimize Duplicate Data Cleansing in PostgreSQL

Image for post
Image for post

Background

Duplicate data cleansing is common business demand. For example, some databases do not support unique constraints, or unique constraints on certain columns are not taken into consideration during the program design, which leads to duplicate data after the application runs for a time. This sparks the demand for duplicate data cleansing.

It’s critical to ascertain which cleansing approaches are efficient. A small application scenario involves 10 technical database issues. Let’s consider each issue in detail in the following sections.

Duplicate Data Cleansing Techniques

For example, assume there is a table with several unique fields that have duplicate values. Now the objective is to keep one of the duplicate values and remove the others.

Example

Remove duplicate (sid + crt_time) combinations and keep the value with the largest mdf_time among the duplicate combinations. One million records of testing data are generated with a 1/10 repeatability rate.

A duplicate value is generated after every 500 records of data to avoid duplicate data in the same data block. The following SQL statement helps to accomplish this.

Verify the ctid to make sure duplicate data is not in the same data block.

Use a window query as the verification method.

The above snippet shows that there are many duplicate values.

Next, let’s remove the duplicates.

Method 1) Insert

Insert the deduplication results in a new table. This takes 5.8 seconds.

Check what optimization is possible. According to the analysis results, sorting should be optimized.

Optimization 1

Cancel sorting of indexes.

After optimization, the process only takes 3.9 seconds.

For online businesses, PostgreSQL allows using CONCURRENTLY to create indexes concurrently without congesting the DML.

Optimization 2

Recursive query and convergence.

In certain cases, this method improves performance hundreds of times over. Use the following method to achieve better results when there are many duplicates.

Use it as shown below when there is UK.

Method 2) Delete

Add a row_number field while importing the table to be processed and create a partial index with where row_number<>1.

Delete this part of the record. The requirement is fulfilled within two seconds.

Verification

Fast Method 1

If the duplicate data comes from text, import the text into the database after deduplication and then export the text.

First, consider the file external tables and then the copy tunnel. Refer to the documentation for more details.

Since the speed is not fast enough. Therefore, let’s try an optimization method.

Optimization of Concurrent Processing

If the file is split into multiple files for concurrent processing, the required time reduces to around 800 milliseconds. But this is not the end. Perform merge sort for global deduplication.

The accelerated speed helps to complete the process in 1 second. Further, increase the concurrency to reduce the total time to around 200 milliseconds.

Therefore, concurrency is not enough.

Let’s look at another method.

Fast Method 2

Import single tables concurrently for processing and then export them without saving the intermediate results. This implies using UNLOGGED TABLE.

It takes 5.28 seconds.

Concurrent Method, Stream Processing — Event Processing

Well, this allows writing data into streams concurrently.

How to Clear Duplicate Rows without a Unique ID?

Delete rows by physical row numbers as shown below.

Optimization Methods of Duplicate Data Cleansing — Technical Issues

The preceding sections have already introduced different optimization methods. Now let’s review each technique.

1) Window Query

The window query function is used to filter and mark duplicate values. Start by creating the composite index with the to-be-deduplicated fields as the window and the rule field as the sorting field.

2) External Table

For the cases where data is from the text, adopt the fast method to deduplicate it. Use the database as a text processing platform, access the file through the file_fdw external table of PostgreSQL, and remove duplicates in SQL.

3) Parallel Computing

Also, when the data is from the text, split the text into several small files and use external tables for concurrent deduplication. However, after deduplication, perform merge sort for global deduplication. PostgreSQL 9.6 supports the use of multiple CPU cores to process a single query, which will linearly improve the performance. (Merge should be considered for deduplication.

4) Recursive Query and Convergence

Recursive query delivers significant improvement in scenarios with high repeatability. It improves performance from dozens of times to hundreds of times.

5) Insert on Conflict

This is a new feature of PostgreSQL 9.5, which completes deduplication upon data import. The deduplication results are exported directly.

Concurrent loading (You cannot update one data record repeatedly in one query.)

6) LLVM

This reduces context switching in multi-row processing, doubling the performance.

7) Stream Computing

It allows streaming deduplication upon data import. Refer to the following article for more details:

8) Concurrent Index Creation

Concurrently create indexes to avoid DML operation congestion. Increasing the maintenance_work_mem during index creation accelerates creation.

9) Concurrent Reading of File Fragments for Import

Split the file for concurrent import to speed up the import. In the future, achieve concurrent file fragment importation by using external access interfaces such as file_fdw.

10) Bulk Load Without Logon

Bulk imports or unlogged tables are applied if the database is only used for calculation, that is when the intermediate processing results of the database do not need to be retained. This improves the import speed, and the auto vacuum is disabled during the import.

Summary

The following two points summarize the article effectively:

  1. If data already exists in the database, deleting the duplicate data in the original table takes around 2 seconds.
  2. If data is imported from text and then exported after deduplication, the process takes around 5.28 seconds.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store