PostgreSQL Fuzzy Searches vs. Regex Matches: A Performance Comparison
By Digoal
Background
PostgreSQL uses pg_trgm plug-in to support regular expressions and LIKE for fully fuzzy searches (prefix/suffix-free fuzzy searches). However, to support searches for Chinese characters, you need to ensure that lc_ctype <> C.
Although in terms of semantics, the meanings of the following two searches are the same.
select * from test where col like '%xxxxxx%';
select * from test where col ~ 'xxxxxx';
However, different processing logic is used in the internal database processing, which respectively correspond to the following code.
- src/backend/utils/adt/like.c
- src/backend/utils/adt/regexp.c
This may cause some performance differences, but the performance of LIKE will be much better.
Optimization Suggestions for Fuzzy Searches and Regex Searches
The processing logic of regular expressions is more complex. Therefore, it is recommended to use LIKE when regular expressions are not necessitated, else use regular expressions.
The following two tests illustrate the performance comparison.
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test select generate_series(1,100000), gen_hanzi(100);
INSERT 0 100000
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);
CREATE INDEX
Although an index is used, the syntax of regex searches is currently not effective for wchar characters, and the entire GIN tree is scanned.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '婐绷乂畳';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=45261409.28..45261421.30 rows=10 width=36) (actual time=583.810..816.503 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '婐绷乂畳'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=4167
Buffers: shared hit=59783
-> Bitmap Index Scan on idx_test_1 (cost=0.00..45261409.28 rows=10 width=0) (actual time=583.237..583.237 rows=100000 loops=1)
Index Cond: (test.info ~ '婐绷乂畳'::text)
Buffers: shared hit=55616
Planning time: 0.150 ms
Execution time: 816.545 ms
(11 rows)
It is important to note that the syntax of regex searches works well for ASCII characters.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '123';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.046..0.046 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.043..0.043 rows=0 loops=1)
Index Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
Planning time: 0.146 ms
Execution time: 0.072 ms
(9 rows)
On the other hand, the syntax of LIKE works well for both ASCII and wchar characters.
-- wchar
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like '%婐绷乂畳%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=13.28..25.30 rows=10 width=36) (actual time=0.042..0.042 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%婐绷乂畳%'::text)
Heap Blocks: exact=1
Buffers: shared hit=8
-> Bitmap Index Scan on idx_test_1 (cost=0.00..13.27 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)
Index Cond: (test.info ~~ '%婐绷乂畳%'::text)
Buffers: shared hit=7
Planning time: 0.110 ms
Execution time: 0.108 ms
(10 rows)
-- ascii
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~~ '%123%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.018..0.018 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.015..0.015 rows=0 loops=1)
Index Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
Planning time: 0.091 ms
Execution time: 0.046 ms
(9 rows)
The preceding two tests show that the operators used for LIKE and regular expressions are different.
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+-------------+-------------------------
pg_catalog | ~~ | text | text | boolean | textlike | matches LIKE expression
pg_catalog | ~ | text | text | boolean | textregexeq | matches regular expression, case-sensitive
The operators correspond to textlike and textregexeq, respectively and the code for both is as follows.
- src/backend/utils/adt/like.c
- src/backend/utils/adt/regexp.c
To achieve the optimal search effect for fully fuzzy searches in the current scenario, it is recommended to use LIKE expression or ~~ expression. Do not use the syntax of regular expressions.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.