Efficiently Implementing Full-table and Full-field Fuzzy Search in Milliseconds with PostgreSQL

Image for post
Image for post

By Digoal

Background

In some applications, you may need to search all the fields in a table such as in PostgreSQL. Some fields may require precise search, while others may require a fuzzy search or full-text search. For example, the selection of drop-down boxes on various front-end pages. This might be a headache for application developers as writing SQL statements is a hassle.

The previous article describes the full-text search scenario. But, what if the user wants to perform a fuzzy search? You can refer to the following series of articles I have previously written about the fuzzy search scenarios:

Now, the question is how to perform fuzzy searches on all the fields in an entire table? The answer is pg_trgm. It is a key technology to efficiently implement full-table and full-field fuzzy search.

Implementation Example of Full-table and Full-field Fuzzy Search

A page is designed at the front-end to allow users to perform fuzzy searches, but its corresponding table has several fields, and the search range is usually all fields.

From the user’s perspective, the experience is good. However, it is a headache for programmers as they do not know which field(s) the user wants to search for. So, the burning question is how can we achieve efficient matching?

To begin with, create a test table, and generate test data.

postgres=# create table t(phonenum text, info text, c1 int, c2 text, c3 text, c4 timestamp);    
CREATE TABLE
postgres=# insert into t values ('13888888888','i am digoal, a postgresqler',123,'china','中华人民共和国,阿里巴巴,阿',now());
INSERT 0 1
postgres=# select * from t;
phonenum | info | c1 | c2 | c3 | c4
-------------+-----------------------------+-----+-------+------------------------------+----------------------------
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2016-04-19 11:15:55.208658
(1 row)

It is critical to note that if the field required to be searched contains Chinese characters or other multi-bytes characters, the database with Collate and Ctype=C cannot be used.

Fortunately, Alibaba Cloud RDS PostgreSQL is not “C” by default, which is great!

如果不是,你可以这么指定collate 和 ctype  

postgres=# create database test with template template0 lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';

Next, create a function index that supports fuzzy searches.

create extension pg_trgm;  

create or replace function record_to_text(anyelement) returns text as $$
select $1::text;
$$ language sql strict immutable;

test=# create index idx_t_1 on t using gin (record_to_text(t) gin_trgm_ops) ;
CREATE INDEX

当需要使用分页,或者结果集很大时,建议使用gist
test=# create index idx_t_2 on t using gist (record_to_text(t) gist_trgm_ops) ;
CREATE INDEX

Now, implement a search test.

test=# explain select * from t where record_to_text(t) ~ 'digoal';  
QUERY PLAN
-------------------------------------------------------------------
Index Scan using idx_t_2 on t (cost=0.38..8.39 rows=1 width=140)
Index Cond: (record_to_text(t.*) ~ 'digoal'::text)
(2 rows)

Also, test the search performance as shown below.

先插一堆数据进去  
postgres=# insert into t select * from t;
INSERT 0 4194304
test=# select count(*) from t;
count
---------
4194304
(1 row)

然后插几条不一样的

insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());
insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌 德哥 刘德华 彭德怀',now());

vacuum analyze t;

High search speed.

test=# explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ 'dege';  
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t_2 on public.t (cost=0.41..2.43 rows=1 width=101) (actual time=0.236..0.254 rows=2 loops=1)
Output: phonenum, info, c1, c2, c3, c4
Index Cond: (record_to_text(t.*) ~ 'dege'::text)
Buffers: shared hit=5
Planning time: 0.349 ms
Execution time: 0.301 ms
(6 rows)

Also, run a test to search for rows containing Andy Lau (GIN index is recommended because few rows are available).

test=# /*+ BitmapScan(t idx_t_1) */ explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ '刘德华' limit 10;  
LOG: available indexes for BitmapScan(t): idx_t_1
LOG: pg_hint_plan:
used hint:
BitmapScan(t idx_t_1)
not used hint:
duplication hint:
error hint:

LOG: pg_hint_plan:
used hint:
not used hint:
BitmapScan(t idx_t_1)
duplication hint:
error hint:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=441.00..442.26 rows=1 width=101) (actual time=0.239..0.255 rows=2 loops=1)
Output: phonenum, info, c1, c2, c3, c4
Buffers: shared hit=4
-> Bitmap Heap Scan on public.t (cost=441.00..442.26 rows=1 width=101) (actual time=0.238..0.252 rows=2 loops=1)
Output: phonenum, info, c1, c2, c3, c4
Recheck Cond: (record_to_text(t.*) ~ '刘德华'::text)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on idx_t_1 (cost=0.00..441.00 rows=1 width=0) (actual time=0.086..0.086 rows=2 loops=1)
Index Cond: (record_to_text(t.*) ~ '刘德华'::text)
Buffers: shared hit=3
Planning time: 0.494 ms
Execution time: 0.313 ms
(13 rows)


test=# /*+ BitmapScan(t idx_t_1) */ select * from t where record_to_text(t) ~ '刘德华' limit 10;
LOG: available indexes for BitmapScan(t): idx_t_1
LOG: pg_hint_plan:
used hint:
BitmapScan(t idx_t_1)
not used hint:
duplication hint:
error hint:

LOG: pg_hint_plan:
used hint:
not used hint:
BitmapScan(t idx_t_1)
duplication hint:
error hint:

phonenum | info | c1 | c2 | c3 | c4
-------------+---------------------------+-----+-------+-------------------------+----------------------------
13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.19215
13888889999 | i am dege, a postgresqler | 123 | china | 德歌 德哥 刘德华 彭德怀 | 2017-01-06 17:04:42.514895
(2 rows)

Time: 1.225 ms

Statement Timeout

Generally, once the index hits results, the response time varies from a fraction of a millisecond to tens of milliseconds based on the number of result sets returned.

However, in some cases, when the amount of information that a user inputs are too small, a slowdown may occur. For example, if a user inputs 2 characters, it generates numerous matched tokens resulting in a slowdown.

You can mitigate this challenge by using GiST. The application layer provides protection. For example, if the response time exceeds 1 second, the statement timeout will be reported.

postgres=# set statement_timeout = '1s';  
SET

or

test=# /*+ Set(statement_timeout 1s) */ select * from t where record_to_text(t) ~ 'd' limit 10;
LOG: pg_hint_plan:
used hint:
Set(statement_timeout 1s)
not used hint:
duplication hint:
error hint:

LOG: pg_hint_plan:
used hint:
Set(statement_timeout 1s)
not used hint:
duplication hint:
error hint:

phonenum | info | c1 | c2 | c3 | c4
-------------+-----------------------------+-----+-------+------------------------------+----------------------------
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国,阿里巴巴,阿 | 2017-01-06 16:51:16.840941
(10 rows)

Use of Hint

The following are a few simple rules to keep in mind while running queries.

  • When result sets are returned with cursors: Use GiST

In all other cases, use GIN.

With the above-listed rules, you can force the use of a specific index through Hint.

Other Optimization

You can also make some optimizations at the service level. For example, you can use the full-text search first, and in case, no matches are found then use the fuzzy search.

Read-only Instances

According to the previous tests, the response of a search should be within 1 ms.

For a 32-core computer, the QPS that this fuzzy search can achieve is estimated to be around 80,000. If you find that a single node no longer meets the concurrency of the search even when it is optimized, you can build a read-only instance. It is simple to build a read-only instance. For further information, please visit this page

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store