PostgreSQL Fuzzy Searches vs. Regex Matches: A Performance Comparison

3 min readDec 17, 2019

By Digoal

Background

PostgreSQL uses pg_trgm plug-in to support regular expressions and LIKE for fully fuzzy searches (prefix/suffix-free fuzzy searches). However, to support searches for Chinese characters, you need to ensure that lc_ctype <> C.

Although in terms of semantics, the meanings of the following two searches are the same.

select * from test where col like '%xxxxxx%';   
   
select * from test where col ~ 'xxxxxx';

However, different processing logic is used in the internal database processing, which respectively correspond to the following code.

src/backend/utils/adt/like.c
src/backend/utils/adt/regexp.c

This may cause some performance differences, but the performance of LIKE will be much better.

Optimization Suggestions for Fuzzy Searches and Regex Searches

The processing logic of regular expressions is more complex. Therefore, it is recommended to use LIKE when regular expressions are not necessitated, else use regular expressions.

The following two tests illustrate the performance comparison.

create or replace function gen_hanzi(int) returns text as $$     
declare     
  res text;     
begin     
  if $1 >=1 then     
    select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);     
    return res;     
  end if;     
  return null;     
end;     
$$ language plpgsql strict;     
   
   
postgres=# create table test(id int, info text);   
CREATE TABLE   
postgres=# insert into test select generate_series(1,100000), gen_hanzi(100);   
INSERT 0 100000   
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);   
CREATE INDEX

Although an index is used, the syntax of regex searches is currently not effective for wchar characters, and the entire GIN tree is scanned.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '婐绷乂畳';   
                                                             QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=45261409.28..45261421.30 rows=10 width=36) (actual time=583.810..816.503 rows=1 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~ '婐绷乂畳'::text)   
   Rows Removed by Index Recheck: 99999   
   Heap Blocks: exact=4167   
   Buffers: shared hit=59783   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..45261409.28 rows=10 width=0) (actual time=583.237..583.237 rows=100000 loops=1)   
         Index Cond: (test.info ~ '婐绷乂畳'::text)   
         Buffers: shared hit=55616   
 Planning time: 0.150 ms   
 Execution time: 816.545 ms   
(11 rows)

It is important to note that the syntax of regex searches works well for ASCII characters.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '123';   
                                                      QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.046..0.046 rows=0 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~ '123'::text)   
   Buffers: shared hit=4   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..38.40 rows=4000 width=0) (actual time=0.043..0.043 rows=0 loops=1)   
         Index Cond: (test.info ~ '123'::text)   
         Buffers: shared hit=4   
 Planning time: 0.146 ms   
 Execution time: 0.072 ms   
(9 rows)

On the other hand, the syntax of LIKE works well for both ASCII and wchar characters.

-- wchar   
   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like '%婐绷乂畳%';   
                                                     QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=13.28..25.30 rows=10 width=36) (actual time=0.042..0.042 rows=1 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~~ '%婐绷乂畳%'::text)   
   Heap Blocks: exact=1   
   Buffers: shared hit=8   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..13.27 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)   
         Index Cond: (test.info ~~ '%婐绷乂畳%'::text)   
         Buffers: shared hit=7   
 Planning time: 0.110 ms   
 Execution time: 0.108 ms   
(10 rows)   
   
-- ascii   
   
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~~ '%123%';   
                                                      QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------   
 Bitmap Heap Scan on public.test  (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.018..0.018 rows=0 loops=1)   
   Output: id, info   
   Recheck Cond: (test.info ~~ '%123%'::text)   
   Buffers: shared hit=4   
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..38.40 rows=4000 width=0) (actual time=0.015..0.015 rows=0 loops=1)   
         Index Cond: (test.info ~~ '%123%'::text)   
         Buffers: shared hit=4   
 Planning time: 0.091 ms   
 Execution time: 0.046 ms   
(9 rows)

The preceding two tests show that the operators used for LIKE and regular expressions are different.

List of operators   
   Schema   | Name | Left arg type | Right arg type | Result type |  Function   |       Description          
------------+------+---------------+----------------+-------------+-------------+-------------------------   
 pg_catalog | ~~   | text          | text           | boolean     | textlike    | matches LIKE expression   
 pg_catalog | ~    | text          | text           | boolean     | textregexeq | matches regular expression, case-sensitive

The operators correspond to textlike and textregexeq, respectively and the code for both is as follows.

src/backend/utils/adt/like.c
src/backend/utils/adt/regexp.c

To achieve the optimal search effect for fully fuzzy searches in the current scenario, it is recommended to use LIKE expression or ~~ expression. Do not use the syntax of regular expressions.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

PostgreSQL Fuzzy Searches vs. Regex Matches: A Performance Comparison

Background

Optimization Suggestions for Fuzzy Searches and Regex Searches

Original Source

PostgreSQL Fuzzy Searches vs. Regex Matches: A Performance Comparison

This article lays down the performance differences between PostgreSQL fuzzy searches and regex matches, and further…

Written by Alibaba Cloud

No responses yet