Efficiently Implementing Full-table and Full-field Fuzzy Search in Milliseconds with PostgreSQL

By Digoal

Background

In some applications, you may need to search all the fields in a table such as in PostgreSQL. Some fields may require precise search, while others may require a fuzzy search or full-text search. For example, the selection of drop-down boxes on various front-end pages. This might be a headache for application developers as writing SQL statements is a hassle.

The previous article describes the full-text search scenario. But, what if the user wants to perform a fuzzy search? You can refer to the following series of articles I have previously written about the fuzzy search scenarios:

Now, the question is how to perform fuzzy searches on all the fields in an entire table? The answer is pg_trgm. It is a key technology to efficiently implement full-table and full-field fuzzy search.

Implementation Example of Full-table and Full-field Fuzzy Search

A page is designed at the front-end to allow users to perform fuzzy searches, but its corresponding table has several fields, and the search range is usually all fields.

From the user’s perspective, the experience is good. However, it is a headache for programmers as they do not know which field(s) the user wants to search for. So, the burning question is how can we achieve efficient matching?

To begin with, create a test table, and generate test data.

It is critical to note that if the field required to be searched contains Chinese characters or other multi-bytes characters, the database with Collate and Ctype=C cannot be used.

Fortunately, Alibaba Cloud RDS PostgreSQL is not “C” by default, which is great!

Next, create a function index that supports fuzzy searches.

Now, implement a search test.

Also, test the search performance as shown below.

High search speed.

Also, run a test to search for rows containing Andy Lau (GIN index is recommended because few rows are available).

Statement Timeout

Generally, once the index hits results, the response time varies from a fraction of a millisecond to tens of milliseconds based on the number of result sets returned.

However, in some cases, when the amount of information that a user inputs are too small, a slowdown may occur. For example, if a user inputs 2 characters, it generates numerous matched tokens resulting in a slowdown.

You can mitigate this challenge by using GiST. The application layer provides protection. For example, if the response time exceeds 1 second, the statement timeout will be reported.

Use of Hint

The following are a few simple rules to keep in mind while running queries.

  • When result sets are returned with cursors: Use GiST
  • When less than 3 characters are entered: Use GiST
  • When a few rows are evaluated: Use GIN

In all other cases, use GIN.

With the above-listed rules, you can force the use of a specific index through Hint.

Other Optimization

You can also make some optimizations at the service level. For example, you can use the full-text search first, and in case, no matches are found then use the fuzzy search.

Read-only Instances

According to the previous tests, the response of a search should be within 1 ms.

For a 32-core computer, the QPS that this fuzzy search can achieve is estimated to be around 80,000. If you find that a single node no longer meets the concurrency of the search even when it is optimized, you can build a read-only instance. It is simple to build a read-only instance. For further information, please visit this page

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.