PostgreSQL Similarity Search: Millisecond-level Similarity Search for Addresses, QA, POIs, and Other Text

Image for post
Image for post

Background

In real life, the similarity search is used in numerous scenarios, such as:

  • When you hire a taxi to go somewhere, the destination that you enter may not exactly match with the actual one stored in the database. Therefore, only the similarity search can help to align your route to a specific destination.
  • Similarly, when you search for queries or problems, your search inputs may be different from existing problems. You can only perform a similarity search to find matches.
  • Similarity search is also used to search for interest points and more.

PostgreSQL supports similarity search, including similarity search for images, arrays, and text. For text, you can use the pg_trgm plug-in to implement a similarity search. It is a test of strings with regular expressions of the magnitude of 10 billion (a combination of 32 random letters and arrays in each row) and fuzzy queries in milliseconds.

Similarity queries use the same plug-in and index. This article describes tests on random Chinese characters and similarity searches to see how PostgreSQL standalone performs.

Build Test Sample Data

Let’s take a look at the following steps required to carry out the test on similarity searches:

Step 1. Create a function to generate random Chinese characters.

Step 2. Use partition tables to improve write and query performance.

Note: The preceding partition method is not used for testing purposes in this article.

Step 3. Now, create a parent table. In order to accelerate the import process, use an unlogged table here. Do not use it if it is generated.

Step 4. Create 64 sub-tables.

Step 5. Next, write 1 billion pieces of test data into the partition. The quick writing method is as follows. It uses DBLink to asynchronously call parallel loading.

There are 64 partitions with 64 random Chinese characters in each row. Each partition has 15,625,000 written rows. Therefore, there are 1 billion rows inserted in total.

This kind of parallel writing method uses the full CPU to write 1 billion random texts at a high speed.

Once the writing is complete, there will be 1 billion records. The table occupies 223 GB of space, and it takes 18 minutes to write records. The sample is as follows.

Step 6. Create an index to accelerate the generation speed. In actual production, you can also create an index first.

There are 1 billion records. GIN inverted indexes occupy 332 GB of space, and it takes nearly 180 minutes to create indexes.

The index creation speed is explained as follows.

The strings written in this case are totally random Chinese characters from tens of thousands of Chinese characters. Therefore, there are a lot of tokens for GIN inverted indexes. In this case, the index is much larger than the index of normal production data. The normal index is not that large, and the normal creation speed is faster than this. However, the creation speed of a GIN index is much slower than that of a BTREE index (which is caused by the structure). You can create a BTREE index within 5 minutes.

SQL Usage for Similarity Queries

Follow the steps listed below to implement SQL for similarity queries:

Step 1. View the current similarity threshold.

Step 2. Set similarity thresholds for the current session. For other settings, refer to the end of the section.

Step 3. The response speed of a similarity search is related to the similarity set by the user. The more the matched values, the slower the response speed. The smaller the matched value (that is, the higher the precision), the faster the response speed.

Step 4. Now, query the similarity text based on input text and get the output in a similar order.

SQL Time: 71 ms.

Step 5. The SQL execution plan is as follows. The execution is very fast as index scan is performed.

Step 6. Now, force parallel execution. PostgreSQL 11 supports the parallel execution of multiple partitions and will result in faster speed.

After parallelism is applied, the SQL execution time becomes 40 milliseconds. Since it was already quite fast, the performance with parallelism added is almost unchanged. When the calculation workload is large, the performance will improve significantly.

Performance Stress Testing for Similarity Queries

Create a stress test function to randomly extract a string from an existing record by using the primary key (and process it to generate a new string with a certain similarity).

Use substring(info,1,28)||gen_hanzi(4)||substring(info,29,28) to generate a new string with a similarity of 0.75.

Implement the stress test function as follows.

Next, you need to query test.

Stress test script is as follows.

Implement the stress test as shown below.

Performance bottleneck analysis is as follows.

CPU is in full workload, and I/O is high.

CPU consumption is mainly caused by a bitmap scan, and a tuple recheck is performed by the CPU.

I/O consumption is mainly caused by a large amount of I/O access. This is because the data and index size is close to 600 GB, which exceeds the memory size.

Conclusion

Data Structure Performance Indicators

Following indicates the data structure performance:

  • Time for writing 1 billion pieces of text data: 18 minutes.
  • GIN index creation time: 180 minutes.

Space Occupied

  • 1 billion text: 223 GB
  • Index: 332 GB

Performance Indicators

The following table shows the respective performance indicators:

Image for post
Image for post

CPU is in full workload, and I/O is high. The CPU consumption is mainly caused by a bitmap scan, and a tuple recheck is performed by the CPU.

I/O consumption is mainly caused by a large amount of I/O access. This is because the data and index size is close to 600 GB, which exceeds the memory size.

Summary

1. PostgreSQL 11 Parallel appending significantly improves the performance.

A similarity search of 1 billion random Chinese strings (64 characters in length) takes only XXX seconds.

2. The benefits of using partition tables, in this case, are as follows.

2.1. Improved parallelism of data writing
2.2. Faster index creation
2.3. Faster index maintenance

Meanwhile, PostgreSQL 11 supports the parallel scanning of multiple partitions (enable_parallel_append) and parallel + merge sort. Therefore, the performance of massive data similarity searches is not a problem in the case of a large amount of computing.

Before PostgreSQL 11, you can use DBLink asynchronous calling to support the parallel scanning of multiple partitions.

3. Similarity is a key point in similarity search. You can use show_limit () to view similarity limits and use set_limit to set similarity thresholds. The greater the similarity value, the higher the matching degree, for which 1 indicates the exact match.

4. You can use similarity or word_similarity to view the similarity values of two strings.

For more information about similarity algorithms, check this document.

5. The response speed of a similarity search is related to the similarity set by the user. The more the matched values, the slower the response speed. The smaller the matched value (that is, the higher the precision), the faster the response speed.

In actual production, use a high limit to search and gradually narrow down the limit to achieve a fast response speed. This logic can be encapsulated in UDFs, and users can call UDFs to search. Let’s check the following example:

UDF helps to search in a faster and more accurate manner.

6. If you want to set a PostgreSQL instance, database, or user-level threshold, you can use the following two parameters.

These parameters act on various operators and functions respectively, see this article for more information.

The following snapshot shows a configuration example.

UDF Segmentation for the Similarity of smlar Arrays: An Example

Step 1. Prepare plug-ins.

Step 2. Test the table using the command below.

Step 3. Create a function to generate random arrays.

Step 4. Now write test data as shown below.

Step 5. Create an index for an array similarity query.

Step 6. Next, create a UDF as shown below.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store