PostgreSQL Fuzzy Search Best Practices: Single-word, Double-word, and Multi-word Fuzzy Search Methods

By Digoal

Background

Prefix fuzzy searches, suffix fuzzy searches, fully fuzzy searches (prefix/suffix-free fuzzy searches), and regex matches are common text search requirements.

PostgreSQL offers powerful text search capabilities and efficiently supports full-text search, fuzzy search, and regex search. The built-in pg_trgm plug-in that facilitates such searches is not available in general databases. Expression indexes and GIN indexes are also built-in.

Different fuzzy search requirements have various optimization methods. Like other databases, PostgreSQL uses a B-tree to accelerate prefix fuzzy searches and suffix fuzzy searches. The function index of the reverse function also helps to accelerate suffix fuzzy searches.

PostgreSQL uses the pg_trgm plug-in to accelerate fully fuzzy searches and regex matches using GIN indexes (this is very effective for fuzzy searches with 3 or more characters input). Further, PostgreSQL also uses custom GIN expression indexes to accelerate custom fuzzy searches.

I. Optimization for Prefix and Suffix Fuzzy Searches

1. Optimization for Prefix Fuzzy Searches: B-tree supports prefix fuzzy searches

1.1 This method is only applicable for searches with COLLATE =”C” when the default “index ops class” is used. You must specify COLLATE “C” for indexes and search criteria under the condition lc_collate<>C.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

Example.

1.2 Another way is available for the B-tree index to support fuzzy searches when the database uses lc_collate<>C by default. Using the corresponding “pattern ops”, the character search method will be used instead of the binary search method.

Refer to this document for the following explanations.

Example.

In case, you use “pattern ops”, index search supports the syntax of LIKE and regular expressions, as follows.

2. Optimization for Suffix Fuzzy Searches: Use the reverse index to support suffix fuzzy searches.

2.1 This method is only applicable for searches with COLLATE =”C” when the default “index ops class” is used. You must specify COLLATE “C” for indexes and search criteria under the condition lc_collate<>C.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

Example.

2.2 When the database uses lc_collate<>C by default, another way is available for the B-tree index to support fuzzy searches. Using the corresponding “pattern ops”, the character search method will be used instead of the binary search method.

When you use “pattern ops”, index search supports the syntax of LIKE and regular expressions.

Example.

3. Optimization for Both- Prefix and Suffix Fuzzy Searches: Use the pg_trgm index to support prefix and suffix fuzzy searches.

Note: To achieve a good index filtering effect, enter at least 1 character for prefix fuzzy searches and at least 2 characters for suffix fuzzy searches.

To efficiently support multi-byte characters (such as Chinese characters), ensure that the lc_ctype of the database is not “C”. The effect will be OK only if the token is correctly split. Set lc_ctype correctly, to correctly split the characters in multi-byte strings one by one.

Indexes can be used only when the COLLATE value is the same for indexes and search criteria.

Use the function for generating random Chinese strings as shown below.

Now, generate random data.

Next, implement the fuzzy search.

II. Optimization for Fully Fuzzy Searches

You can use the pg_trgm plug-in to support fully fuzzy searches.

Note: To enable pg_trgm efficiently support multi-byte characters (such as Chinese characters), the lc_ctype of the database cannot be “C”. The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, the characters in multi-byte strings are correctly split one by one.

It is recommended to enter 3 or more characters. Otherwise, the effect will not be good (we will analyze the cause later).

Example.

III. Optimization for Regex Matches

The syntax for the PostgreSQL regex match is字符串 ~ ‘pattern’ or 字符串 ~* ‘pattern’. For more information about functions matching, check out this document.

Example.

Refer to contrib/pg_trgm/trgm_regexp.c for the principle of the regex match index.

Principle of the pg_trgm Fuzzy Search

First, pg_trgm adds two spaces before the string and one space at the end of the string. Then, the characters are split in a way that each 3 consecutive characters are a token. Finally, GIN reverse indexes are created for the tokens. To view the tokens of a string, you can use the following method.

Following are the reasons why the number of characters for pg_trgm prefix or suffix fuzzy searches, and fully fuzzy searches need to meet the requirements.

  • For prefix fuzzy searches (such as a%), at least 1 character must be provided. (The search is for results that meet the condition that token=’a ‘)
  • For suffix fuzzy searches (such as %ab), at least 2 characters must be provided. (The search is for results that meet the condition that token=’ ab’)
  • For fully fuzzy searches (such as %abcd%), at least 3 characters must be provided. (In this example, the search is for results that meet the condition that the token(s) contains {“a”,”ab”,abc,bcd,”cd “}).

It is necessary to abide by these conditions because the token generated by pg_trgm has three characters, and a match can be found only when the preceding three conditions are met.

IV. Optimization of Fuzzy Searches With Less Than 3 Input Characters

pg_trgm does not apply when the fully fuzzy search is based on 1 or 2 characters. In this case, you can use expression and GIN indexes.

Firstly, use an expression to split the string into one word and two arrays containing consecutive characters and then create GIN indexes for the arrays.

Example.

V. Optimization for Similarity Searches

Fuzzy searches and regex matches aim to find the records that completely match the search criteria. A third search type is the similarity search. Consider an example, for the PostgreSQL string, even if p0stgresgl is entered, it can still be matched according to the similarity. Use the pg_trgm plug-in here. To support Chinese characters, the following requirements are also applied.

To enable pg_trgm to support similar searches for Chinese characters, the lc_ctype of the database cannot be “C”. The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, can the characters in multi-byte strings be correctly split one by one. (Character classification [What is a letter? Its upper-case equivalent?]).

It is recommended to enter 3 or more characters. Otherwise, the effect will not be good (the cause will be analyzed later).

Example.

VI. Summary

1. If you only need prefix fuzzy searches (the string LIKE “xx%”), use the B-tree index with collate=”C”. If collate is not “C”, you can use the corresponding “pattern ops” (for example, text_pattern_ops) to create a B-tree index.

2. If you only need suffix fuzzy searches (the string LIKE ‘“%abc” is equivalent to the reverse string LIKE “cba%”), use the B-tree index of reverse() expression with collate=”C”. If collate is not “C”, you can use the corresponding “pattern ops” (for example, text_pattern_ops) to create a B-tree index.

3. If you need fully fuzzy searches that contain Chinese characters, use the database with lc_ctype <> “C”, together with the GIN index of the pg_trgm plug-in. (The effect will be OK only if the token is correctly split. Only when lc_ctype is set correctly, can the characters in multi-byte strings be correctly split one by one. [Character classification {What is a letter? Its upper-case equivalent?}].)

4. If you need fully fuzzy searches that do not contain any Chinese characters, use the GIN index of the pg_trgm plug-in.

5. If you need regex searches, use the GIN index of the pg_trgm plug-in.

6. If the number of input characters is less than 3, use GIN expression indexes along with arrays to accelerate the fuzzy search.

VII. Performance

100 million records are provided, and each record contains 15 random Chinese characters. Follow the steps listed below to test the performance of the prefix fuzzy search, suffix fuzzy search, and fully fuzzy search.

Step 1. Generate test data.

Step 2. Create indexes as shown below.

Refer the following table and index sizes.

Step 3. Test the fuzzy search performance.

Step 3.1 Performance of prefix fuzzy search is as follows:

  • Response time: 9 ms
  • 4701 rows are returned.

Step 3.2 Performance of suffix fuzzy search is as follows:

  • Response time: 0.25 ms
  • 2 rows are returned.

Step 3.3 Performance of fully fuzzy search is as follows:

  • Response time: 0.2 ms
  • 1 row is returned.

Summary

For fuzzy searches based on three or more characters, pg_trgm is a good solution.

For fuzzy searches based on 1 or 2 characters, you can achieve good performance by using expression indexes.

It is recommended that the application determines the number of characters and selects the correct SQL syntax when performing the search.

lc_collate, lc_ctype

Check whether PostgreSQL is actually using the required locale. The LC_COLLATE and LC_CTYPE settings are determined when a database is created, and cannot be changed except by creating a new database. Other locale settings including LC_MESSAGES and LC_MONETARY are initially determined by the environment the server is started in, but can be changed on-the-fly. You can check the active locale settings using the SHOW command.

lc_ctype and lc_collate can only be specified when the database is created. Once specified, they cannot be modified. Refer this document to know more about the role of lc_ctype and lc_collate.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store