Performance Optimization of Fuzzy Queries for Chinese Characters Using PostgreSQL trgm

10 min readDec 17, 2019

By Digoal

Background

Fuzzy prefix or suffix queries and regexp matching are common in text search and may hamper the overall search performance.

Like other databases, PostgreSQL also uses B-tree indexes for speeding up fuzzy prefix or suffix queries. You can also use an index on a reverse function for speeding up fuzzy suffix queries. In addition to full-text search, PostgreSQL also provides trgm, which is not available in other databases.

The trgm module can be used for fuzzy prefix/suffix queries and regexp matching. It is a powerful plugin that significantly improves text search performance. For instance, trgm brings a 500X performance improvement in a search scenario that involves around one million data records.

ASCII Character Fuzzy Query/Regexp Matching: An Example

The following example shows how to generate one million data records and test the fuzzy query performance.

create extension pg_trgm;    
    
postgres=# create table tbl (id int, info text);    
CREATE TABLE    
postgres=# insert into tbl select generate_series(1,1000000), md5(random()::text);    
INSERT 0 1000000    
postgres=# create index idx_tbl_1 on tbl using gin(info gin_trgm_ops);    
CREATE INDEX    
    
postgres=# select * from tbl limit 10;    
 id |               info                   
----+----------------------------------    
  1 | dc369f84738f7fa4dc38c364cef817d0    
  2 | 4912b0b16670c4f2390d44ae790b9809    
  3 | eb442b00bf3b5bc6863d004a2c8fa3bb    
  4 | 0b4b8a8ad0cdf2e6870afbb94813eba4    
  5 | 661e895ee982ec4d9f944b10adffb897    
  6 | 09c4e7476d4bdfc1ccbdfe92ba0fdbdf    
  7 | 8b6e442faed938d066dda5e552100277    
  8 | e5cdeca599d5068a8d3bb6ce9f370827    
  9 | ddbbfbeaa9199219b7c909fb395d9a69    
 10 | 96f254f64df1ec43bb0cb4801222c919    
(10 rows)    
    
postgres=# select * from tbl where info ~ '670c4f2';    
 id |               info                   
----+----------------------------------    
  2 | 4912b0b16670c4f2390d44ae790b9809    
(1 row)    
Time: 2.668 ms    
    
postgres=# explain analyze select * from tbl where info ~ '670c4f2';    
                                                     QUERY PLAN                                                          
---------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on tbl  (cost=28.27..138.43 rows=100 width=37) (actual time=1.957..1.958 rows=1 loops=1)    
   Recheck Cond: (info ~ '670c4f2'::text)    
   Heap Blocks: exact=1    
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..28.25 rows=100 width=0) (actual time=1.939..1.939 rows=1 loops=1)    
         Index Cond: (info ~ '670c4f2'::text)    
 Planning time: 0.342 ms    
 Execution time: 1.989 ms    
(7 rows)

If trgm is not used to optimize the query, the preceding example query takes 1,657 milliseconds.

postgres=# set enable_bitmapscan=off;    
SET    
Time: 0.272 ms    
postgres=# select * from tbl where info ~ 'e770044a';    
 id |               info                   
----+----------------------------------    
  6 | 776c3cdf5fa818a324ef3e770044a488    
(1 row)    
Time: 1657.231 ms

Pg_trgm significantly improves the performance of ASCII character query.

1. Support for Chinese Characters (For Versions Prior to 9.3)

pg_trgm supports wchar for all the versions starting from PostgreSQL 9.3. In case you are using a version earlier than 9.3, convert the context to bytea as the query performance before the conversion to bytea is not high.

postgres=# explain analyze select * from tbl where info ~ '中国';    
                                                       QUERY PLAN                                                           
------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on tbl  (cost=149.62..151.82 rows=2 width=37) (actual time=8.624..8.624 rows=0 loops=1)    
   Recheck Cond: (info ~ '中国'::text)    
 Rows Removed by Index Recheck: 10103    
   Heap Blocks: exact=156    
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..149.61 rows=2 width=0) (actual time=1.167..1.167 rows=10103 loops=1)    
         Index Cond: (info ~ '中国'::text)    
 Planning time: 0.244 ms     Execution time: 8.657 ms    
(8 rows)    
Time: 9.388 ms

The query plan shows that although an index is used for Chinese characters, tokens are not properly used. Instead, the recheck index plays a big role.

Here, the performance is even lower than using a full-table scan.

postgres=# set enable_bitmapscan=off;    
SET    
postgres=# explain analyze select * from tbl where info ~ '中国';    
                                           QUERY PLAN                                               
------------------------------------------------------------------------------------------------    
 Seq Scan on tbl  (cost=0.00..399.75 rows=2 width=37) (actual time=6.899..6.899 rows=0 loops=1)    
   Filter: (info ~ '中国'::text)    
   Rows Removed by Filter: 10103    
 Planning time: 0.213 ms    
 Execution time: 6.921 ms    
(5 rows)    
Time: 7.593 ms

Convert Chinese Characters into the bytea Type and Support pg_trgm Indexes

You can use function-based indexes in PostgreSQL and the conversion to the bytea type (converting text into ASCII codes) to implement this feature.

Consider the following example.

postgres=# select text(textsend(info)) from tbl limit 10;    
                                                                                       text                                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
 \xe7abbde69b8ce7b5a4e8b197e5afa9e58c88e991a6e7b18ce5b495e8a79fe7ae8ee882bce7a283e7af9de8a086e7ac8de59e81e5a6bae9bcb6e6ba9fe981bbe4bda8e7928de98ab0e5a18de697b5e79fabe9b0a5e9b0a5    
 \xe5aa8ee69ab5e58996e892b0e89484e587b0e8bcbce69f80e79eb8e89390e7baa8e79f93e582b6e98f81e9a18ee9b48ee9ba8ce784a6e8b5a2e5a797e9a3b5e5a4aee986b1e9919de6b19ce9bdb9e6bbb6e8b5bde8b5bd    
 \xe7b4a4e5b2b3e7ac96e79481e78dbce5b28ae6b9b6e88dafe5aebce4bcbde8a3a3e4be98e78e93e5848ae4b888e5b0b5e5aeaee9aeb2e99982e59a98e6b0b2e583b3e9b799e893a5e5ba89e8949fe7868ee78cbde78cbd    
 \xe797a3e4b991e8baaee9ae88e69db5e78c99e9a8abe9bd80e7bd98e8b3bae89cb5e799bbe78d89e990a7e5b989e6a484e6a1a1e6939ce9b490e890b4e9a5abe6b392e58a9be5adaae9b895e89985e8a79ee8b889e8b889    
 \xe687a4e9b795e58094e9b0a6e6a58ee4bd80e6898ae6bdbee7828de788bde79897e8be83e59b93e7908ae9879be7b093e89eaae6a3bce792bee59e9ae8b5abe7a89fe9b6aae99bbae9a18fe6b3abe7b7aae89282e89282    
 \xe996b8e5a4b7e6b2b7e8a397e6a898e58a94e6a4a5e586b3e9b8b5e5ba98e99ba4e99c90e6be90e88d94e99dade89892e594abe59d98e5a7afe592a0e58c9be59590e8a299e7bb86e9abace7a5bee881bde793a7e793a7    
 \xe795aee7bba4e4bc86e7b29ae780b2e7bd9fe8a9bee8bf97e68486e5a4bde8a79ee6bf8be98cb8e8b6bfe4bb8ae88ba3e8ba98e6acb8e6aa94e59ab5e697bfe78b96e6859be7afb9e9bb85e799a7e798a3e6a982e6a982    
 \xe98987e7828be585ace9808ce5959be6b4a0e582ade59fbfe7b18ee792b9e8bd87e8849ce89d98e4b8b4e7af9ce6abb3e98a8ce89490e897bde59ea7e8a5a8e98a94e7848be59abae5bb9be890b6e58188e6acb8e6acb8    
 \xe7898de88880e89abfe99dbfe5bab9e5b387e8b3a7e8a0bfe9a4a7e5aa9be6a18ee68ca7e9b2b2e58b8de6a088e6a4abe5a481e58297e4bb90e5b780e786b4e6958de58bb4e78884e9ae98e9909ae8b19be984a8e984a8    
 \xe6b4a8e8b99ee6b789e8bfb9e9b69de9b0a6e9b7bde59fbae6a886e793a1e691ace9a185e5bba1e699a5e9bcace78598e9adaee9b199e59eb5e897b6e88f92e69caee8b9ade8beade4bdbae5b3b6e599b9e7bea1e7bea1    
(10 rows)    
Time: 0.457 ms

To create a GIN index on the bytea text, you need to create an immutable function. Make sure that the created index and query have consistent encoding with the client because you can hit the results only when query and storage have consistent encoding.

postgres=# create or replace function textsend_i (text) returns bytea as $$    
  select textsend($1);    
$$ language sql strict immutable;    
CREATE FUNCTION    
    
postgres=# drop index idx_tbl_1 ;    
DROP INDEX    
Time: 10.179 ms    
postgres=# create index idx_tbl_1 on tbl using gin(text(textsend_i(info)) gin_trgm_ops);    
CREATE INDEX

The GIN index on the bytea type significantly improves the query performance. The larger the data volume, the better the performance.

postgres=# set enable_bitmapscan=on;    
postgres=# explain analyze select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('中国')), '\x');    
                                                      QUERY PLAN                                                          
----------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on tbl  (cost=369.28..504.93 rows=100 width=37) (actual time=0.099..0.099 rows=0 loops=1)    
   Recheck Cond: ((textsend_i(info))::text ~ 'e4b8ade59bbd'::text)    
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..369.25 rows=100 width=0) (actual time=0.097..0.097 rows=0 loops=1)    
         Index Cond: ((textsend_i(info))::text ~ 'e4b8ade59bbd'::text)    
 Planning time: 0.494 ms    
 Execution time: 0.128 ms    
(6 rows)    
    
postgres=# select * from tbl limit 10;    
 id |                            info                                
----+------------------------------------------------------------    
  1 | 竽曌絤豗審匈鑦籌崕觟箎肼碃篝蠆笍垁妺鼶溟遻佨璍銰塍旵矫鰥鰥    
  2 | 媎暵剖蒰蔄凰輼柀瞸蓐纨矓傶鏁顎鴎麌焦赢姗飵央醱鑝汜齹滶赽赽    
  3 | 紤岳笖甁獼岊湶药宼伽裣侘玓儊丈尵宮鮲陂嚘氲僳鷙蓥庉蔟熎猽猽    
  4 | 痣乑躮鮈杵猙騫齀罘賺蜵登獉鐧幉椄桡擜鴐萴饫泒力孪鸕虅觞踉踉    
  5 | 懤鷕倔鰦楎佀扊潾炍爽瘗较囓琊釛簓螪棼璾垚赫稟鶪雺顏泫緪蒂蒂    
  6 | 閸夷沷裗樘劔椥决鸵庘雤霐澐荔靭蘒唫坘姯咠匛啐袙细髬祾聽瓧瓧    
  7 | 畮绤伆粚瀲罟詾迗愆夽觞濋錸趿今苣躘欸檔嚵旿狖慛篹黅癧瘣橂橂    
  8 | 鉇炋公逌啛洠傭埿籎璹轇脜蝘临篜櫳銌蔐藽垧襨銔焋嚺廛萶偈欸欸    
  9 | 牍舀蚿靿庹峇賧蠿餧媛桎挧鲲勍栈椫夁傗仐巀熴敍勴爄鮘鐚豛鄨鄨    
 10 | 洨蹞淉迹鶝鰦鷽基樆瓡摬顅廡晥鼬煘魮鱙垵藶菒朮蹭辭佺島噹羡羡    
(10 rows)    
    
postgres=# explain analyze select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('坘')), '\x');    
                                                      QUERY PLAN                                                          
----------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on tbl  (cost=149.88..574.79 rows=320 width=37) (actual time=0.063..0.063 rows=0 loops=1)    
   Recheck Cond: ((textsend_i(info))::text ~ 'e59d98'::text)    
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..149.80 rows=320 width=0) (actual time=0.061..0.061 rows=0 loops=1)    
         Index Cond: ((textsend_i(info))::text ~ 'e59d98'::text)    
 Planning time: 0.303 ms    
 Execution time: 0.087 ms    
(6 rows)    
    
postgres=# select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('坘')), '\x');    
  id  |                            info                                
------+------------------------------------------------------------    
    6 | 閸夷沷裗樘劔椥决鸵庘雤霐澐荔靭蘒唫坘姯咠匛啐袙细髬祾聽瓧瓧    
  432 | 飒莭鮊鍥?笩妳琈笈慻儘轴轧坘碠郎蚿呙偓鍹脆鼺蹔谕蚱畨縫鱳鱳    
  934 | 咓僨復圼峷奁扉羰滵樞韴迬猰優鰸獤溅躐瓜抵権纀懶粯坘蚲纾鴁鴁    
 3135 | 倣稽蛯巭瘄皮蓈睫柨苧眱賴髄猍乱歖痐坘恋顎东趥谓鰪棩剔烱茟茟    
 3969 | 崴坘螏顓碴鵰邰欴苄蛨簰瘰膪菷栱镘衟齘觊诀忮繈憘痴峣撋梆澝澝    
 4688 | 围豁啖顫诬呅尥腥缾郸熛枵焐篯坘僇矟銘隨譼鎶舰肳礞婛轲蠟慕慕    
 6121 | 窳研稼旅唣疚褣鬾韨赑躽坘浒攁舑遬鳴滴抓嗠捒铗牜欘質丛姤騖騖    
 6904 | 飘稘輔鬄枠舶婬儁噈坘裎姖爙炃苖隽斓堯鈶摙蚼疁兗快鐕鎒墩譭譭    
 8854 | 叒鐲唬鞩泍糕懜坘戚靥鎿鋂炿尟汜阢甌鲖埁顔胳邉謾宱肦劰責戆戆    
 9104 | 鵬篱爯俌坘柉誵孀漴纞錀澁摫螭芄餜爹綅俆逨哒猈珢輿廄陲欗缷缷    
 9404 | 民坘謤齏隽紽峐荟頩胯頴傳蠂枯滦榦陠帡疃鈶遽艌瘧蒭嗍龞瓈嚍嚍    
 9727 | 夃坘慫逹壪泵偉鸶揺雠倴矸虠覾芽齏遬儂錞鐴焑劽疁擯蛛倞瑫菰菰    
(12 rows)

2. Support for Chinese Characters (For Version 9.3 or Later)

A prerequisite for pg_trgm to support Chinese characters is that both Collate and Ctype in the database cannot be C.

In the following databases where Collate and Ctype are C, pg_trgm does not support wchar (including Chinese characters).

postgres=# \l+  
    n case f you are using a version earlier than 9.3, you need to convert the context to bytea.

It is critical to note that the query performance is not high before the conversion to bytea.

postgres=# explain analyze select * from tbl where info ~ '中国';    
                                                       QUERY PLAN                                                           
------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on tbl  (cost=149.62..151.82 rows=2 width=37) (actual time=8.624..8.624 rows=0 loops=1)    
                                                                    List of databases  
        Name        |  Owner   | Encoding  |  Collate   |   Ctype    |   Access privileges   |  Size   | Tablespace |                Description                   
--------------------+----------+-----------+------------+------------+-----------------------+---------+------------+--------------------------------------------  
 contrib_regression | postgres | UTF8      | C          | C          |                       | 9313 kB | pg_default |   
 db                 | postgres | SQL_ASCII | C          | C          |                       | 7359 kB | pg_default |   
 db1                | postgres | EUC_CN    | C          | C          |                       | 7351 kB | pg_default |   
 postgres           | postgres | UTF8      | C          | C          |                       | 1686 MB | pg_default | default administrative connection database  
 template0          | postgres | UTF8      | C          | C          | =c/postgres          +| 7225 kB | pg_default | unmodifiable empty database  
                    |          |           |            |            | postgres=CTc/postgres |         |            |   
 template1          | postgres | UTF8      | C          | C          | =c/postgres          +| 7225 kB | pg_default | default template for new databases  
                    |          |           |            |            | postgres=CTc/postgres |         |            |   
 test               | postgres | UTF8      | en_US.UTF8 | en_US.UTF8 |                       | 7415 kB | pg_default |   
 test01             | postgres | UTF8      | C          | C          |                       | 1621 MB | pg_default |   
(8 rows)

Example 1: When Collate or Ctype is C-wchar is not supported.

postgres=# \c db1  
You are now connected to database "db1" as user "postgres".  
db1=# create extension pg_trgm;  
CREATE EXTENSION  
db1=# select show_trgm('你好');  
 show_trgm   
-----------  
 {}  
(1 row)

Example 2: When Collate or Ctype is Smaller or Larger than C-wchar is Supported.

db1=# \c test  
You are now connected to database "test" as user "postgres".  
test=# select show_trgm('你好');  
        show_trgm          
-------------------------  
 {0xcf7970,0xf98da8,IgR}  
(1 row)

You must specify Collate and Ctype while creating a database,

For example.

postgres=# create database test02 with template template0 lc_collate "zh_CN.UTF8" lc_ctype "zh_CN.UTF8" encoding 'UTF8';  
CREATE DATABASE  
  
postgres=# \l+ test02  
                                                List of databases  
  Name  |  Owner   | Encoding |  Collate   |   Ctype    | Access privileges |  Size   | Tablespace | Description   
--------+----------+----------+------------+------------+-------------------+---------+------------+-------------  
 test02 | postgres | UTF8     | zh_CN.UTF8 | zh_CN.UTF8 |                   | 7225 kB | pg_default |   
(1 row)

Accelerate Chinese Character Fuzzy Query

As mentioned in the preceding section, the database prerequisite is that Collate or Ctype must be smaller or larger than C. Consider the following examples for better understanding.

Example 1: GIN

postgres=# \c test02  
You are now connected to database "test02" as user "postgres".  
test02=# create extension pg_trgm;  
CREATE EXTENSION  
test02=# create table test(id int, info text);  
CREATE TABLE  
test02=# insert into test values (1,'你好，我是中国人');  
INSERT 0 1  
test02=# create index idx_test_1 on test using gin(info gin_trgm_ops);    
CREATE INDEX  
test02=# set enable_seqscan=off;  
SET  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '北京天安门';  
                                                    QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.test  (cost=5.20..6.51 rows=1 width=36) (actual time=0.075..0.075 rows=0 loops=1)  
   Output: id, info  
   Recheck Cond: (test.info ~ '北京天安门'::text)  -- 说明索引已过滤了  
   Buffers: shared hit=4  
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..5.20 rows=1 width=0) (actual time=0.070..0.070 rows=0 loops=1)  
         Index Cond: (test.info ~ '北京天安门'::text)  
         Buffers: shared hit=4  
 Planning time: 0.174 ms  
 Execution time: 0.107 ms  
(9 rows)  
  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '1';  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.test  (cost=13.01..14.32 rows=1 width=36) (actual time=0.052..0.052 rows=0 loops=1)  
   Output: id, info  
   Recheck Cond: (test.info ~ '1'::text)  
   Rows Removed by Index Recheck: 1  -- 命中索引（与TOKEN有关）, 通过recheck过滤成功  
   Heap Blocks: exact=1  
   Buffers: shared hit=4  
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..13.01 rows=1 width=0) (actual time=0.040..0.040 rows=1 loops=1)  
         Index Cond: (test.info ~ '1'::text)  
         Buffers: shared hit=3  
 Planning time: 0.157 ms  
 Execution time: 0.076 ms  
(11 rows)  
  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '你好';  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.test  (cost=13.00..14.31 rows=1 width=36) (actual time=0.052..0.052 rows=1 loops=1)  
   Output: id, info  
   Recheck Cond: (test.info ~ '你好'::text)  -- 命中索引  
   Heap Blocks: exact=1  
   Buffers: shared hit=4  
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..13.00 rows=1 width=0) (actual time=0.040..0.040 rows=1 loops=1)  
         Index Cond: (test.info ~ '你好'::text)  
         Buffers: shared hit=3  
 Planning time: 0.156 ms  
 Execution time: 0.077 ms  
(10 rows)  
  
  
test02=# select * from test where info ~ '1';  
 id | info   
----+------  
(0 rows)  
  
test02=# select * from test where info ~ '你好';  
 id |       info         
----+------------------  
  1 | 你好，我是中国人  
(1 row)  
  
test02=# select * from test where info ~ '北京天安门';  
 id | info   
----+------  
(0 rows)

Example 2: GiST

test02=# create index idx_test_2 on test using gist(info gist_trgm_ops);    
CREATE INDEX  
test02=# drop index idx_test_1;  
DROP INDEX  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '你好';  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_2 on public.test  (cost=0.12..2.74 rows=1 width=36) (actual time=0.081..0.082 rows=1 loops=1)  
   Output: id, info  
   Index Cond: (test.info ~ '你好'::text)  
   Buffers: shared hit=2  
 Planning time: 0.134 ms  
 Execution time: 0.121 ms  
(6 rows)  
  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '1';  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_2 on public.test  (cost=0.12..2.74 rows=1 width=36) (actual time=0.079..0.079 rows=0 loops=1)  
   Output: id, info  
   Index Cond: (test.info ~ '1'::text)  
   Rows Removed by Index Recheck: 1  
   Buffers: shared hit=2  
 Planning time: 0.068 ms  
 Execution time: 0.107 ms  
(7 rows)  
  
test02=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '北京天安门';  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_2 on public.test  (cost=0.12..2.74 rows=1 width=36) (actual time=0.102..0.102 rows=0 loops=1)  
   Output: id, info  
   Index Cond: (test.info ~ '北京天安门'::text)  
   Buffers: shared hit=1  
 Planning time: 0.067 ms  
 Execution time: 0.130 ms  
(6 rows)

GiST or GIN: Which One is Better

If the filtering criteria return a large result set (for example, more than 10,000 rows) and you need to limit the returned result set, GiST is recommended. On the other hand, if the filtering criteria return a small result set, GIN is recommended.

3. Sort Output Results by Similarity for Inexact and Fuzzy Matching

Use a GiST index to sort results based on similarity. But, this may return inexact matching results. For example, PostgreSQL may be returned in one of the first rows because it is very similar to gerSQL. However, the user may not need it.

Consider the following example.

CREATE TABLE test_trgm (t text);  
  
CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);  
  
SELECT t, t <-> 'word' AS dist  
  FROM test_trgm  
  ORDER BY dist LIMIT 10;  
  
或者  
  
SELECT t  
  FROM test_trgm  
  ORDER BY t <-> 'word' desc LIMIT 10;  
  
如果列包含中文，同样可以使用前面的immutable函数索引代替之

Note

Since pg_trgm considers three consecutive characters for a token, the performance might not be good if the word you want to query consists of only one or two characters. At least one character is required for prefix matching and at least two for suffix matching, for example, ‘^a’ and ‘ab$’. This ensures that a token is matched at least and inverted indexes are used for optimization.

It is recommended that you use pg_trgm to query a word consisting of at least three characters.

Now, you may ask how to implement a fuzzy query for a string consisting of one or two characters?

Split the string into an array of one or two consecutive characters, create a GIN index on the array and search for array @> {target word}.

Other Considerations

Many recheck operations may occur when a string is too short (for example, less than 3 characters) or a word has high-frequency (many instances of the word in the text). This happens because too many tokens are hit during the first filtering phase and all blocks after the combination operation meet the query criteria.

The solution or evaluation approach to the aforementioned problem requires to adjust the input parameters if the evaluation finds too many rows.

CREATE FUNCTION count_estimate(query text) RETURNS INTEGER AS  
$func$  
DECLARE  
    rec   record;  
    ROWS  INTEGER;  
BEGIN  
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP  
        ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');  
        EXIT WHEN ROWS IS NOT NULL;  
    END LOOP;  
  
    RETURN ROWS;  
END  
$func$ LANGUAGE plpgsql;

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Performance Optimization of Fuzzy Queries for Chinese Characters Using PostgreSQL trgm

Background

ASCII Character Fuzzy Query/Regexp Matching: An Example

1. Support for Chinese Characters (For Versions Prior to 9.3)

Convert Chinese Characters into the bytea Type and Support pg_trgm Indexes

2. Support for Chinese Characters (For Version 9.3 or Later)

Accelerate Chinese Character Fuzzy Query

GiST or GIN: Which One is Better

3. Sort Output Results by Similarity for Inexact and Fuzzy Matching

Note

Other Considerations

Original Source

Performance Optimization of Fuzzy Queries for Chinese Characters Using PostgreSQL trgm

This article describes how trgm, a powerful plugin significantly improves text search performance for fuzzy prefix and…

Written by Alibaba Cloud

No responses yet