Performance Optimization of Fuzzy Queries for Chinese Characters Using PostgreSQL trgm

By Digoal

Background

Fuzzy prefix or suffix queries and regexp matching are common in text search and may hamper the overall search performance.

Like other databases, PostgreSQL also uses B-tree indexes for speeding up fuzzy prefix or suffix queries. You can also use an index on a reverse function for speeding up fuzzy suffix queries. In addition to full-text search, PostgreSQL also provides trgm, which is not available in other databases.

The trgm module can be used for fuzzy prefix/suffix queries and regexp matching. It is a powerful plugin that significantly improves text search performance. For instance, trgm brings a 500X performance improvement in a search scenario that involves around one million data records.

ASCII Character Fuzzy Query/Regexp Matching: An Example

The following example shows how to generate one million data records and test the fuzzy query performance.

If trgm is not used to optimize the query, the preceding example query takes 1,657 milliseconds.

Pg_trgm significantly improves the performance of ASCII character query.

1. Support for Chinese Characters (For Versions Prior to 9.3)

pg_trgm supports wchar for all the versions starting from PostgreSQL 9.3. In case you are using a version earlier than 9.3, convert the context to bytea as the query performance before the conversion to bytea is not high.

The query plan shows that although an index is used for Chinese characters, tokens are not properly used. Instead, the recheck index plays a big role.

Here, the performance is even lower than using a full-table scan.

Convert Chinese Characters into the bytea Type and Support pg_trgm Indexes

You can use function-based indexes in PostgreSQL and the conversion to the bytea type (converting text into ASCII codes) to implement this feature.

Consider the following example.

To create a GIN index on the bytea text, you need to create an immutable function. Make sure that the created index and query have consistent encoding with the client because you can hit the results only when query and storage have consistent encoding.

The GIN index on the bytea type significantly improves the query performance. The larger the data volume, the better the performance.

2. Support for Chinese Characters (For Version 9.3 or Later)

A prerequisite for pg_trgm to support Chinese characters is that both Collate and Ctype in the database cannot be C.

In the following databases where Collate and Ctype are C, pg_trgm does not support wchar (including Chinese characters).

It is critical to note that the query performance is not high before the conversion to bytea.

Example 1: When Collate or Ctype is C-wchar is not supported.

Example 2: When Collate or Ctype is Smaller or Larger than C-wchar is Supported.

You must specify Collate and Ctype while creating a database,

For example.

Accelerate Chinese Character Fuzzy Query

As mentioned in the preceding section, the database prerequisite is that Collate or Ctype must be smaller or larger than C. Consider the following examples for better understanding.

Example 1: GIN

Example 2: GiST

GiST or GIN: Which One is Better

If the filtering criteria return a large result set (for example, more than 10,000 rows) and you need to limit the returned result set, GiST is recommended. On the other hand, if the filtering criteria return a small result set, GIN is recommended.

3. Sort Output Results by Similarity for Inexact and Fuzzy Matching

Use a GiST index to sort results based on similarity. But, this may return inexact matching results. For example, PostgreSQL may be returned in one of the first rows because it is very similar to gerSQL. However, the user may not need it.

Consider the following example.

Note

Since pg_trgm considers three consecutive characters for a token, the performance might not be good if the word you want to query consists of only one or two characters. At least one character is required for prefix matching and at least two for suffix matching, for example, ‘^a’ and ‘ab$’. This ensures that a token is matched at least and inverted indexes are used for optimization.

It is recommended that you use pg_trgm to query a word consisting of at least three characters.

Now, you may ask how to implement a fuzzy query for a string consisting of one or two characters?

Split the string into an array of one or two consecutive characters, create a GIN index on the array and search for array @> {target word}.

Other Considerations

Many recheck operations may occur when a string is too short (for example, less than 3 characters) or a word has high-frequency (many instances of the word in the text). This happens because too many tokens are hit during the first filtering phase and all blocks after the combination operation meet the query criteria.

The solution or evaluation approach to the aforementioned problem requires to adjust the input parameters if the evaluation finds too many rows.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store