PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds

31 min readDec 17, 2019

By Digoal

Background

Usually, Regex match and fuzzy match are the specialties of search engines. However, if you use a PostgreSQL database, it is still possible to implement these functions with good performance. In addition, coupled with distributed solutions (such as plproxy, pg_shard, fdw shard, pg-xc, pg-xl, greenplum), PostgreSQL database helps to effectively perform regular matches and fuzzy matches for more than ten billion data records without losing the inherent functions of the database.

The Internet of Things (IoT) generates a large amount of data, which includes not only digital data, but also string-type data, such as bar codes, license plates, mobile phone numbers, emails, and names. So, what is an efficient method to perform fuzzy searches or even regex matches on a large amount of sensor data?

There are several such scenarios, for example, a batch of drugs in the market may have problems. Then a regex search needs to be performed for drug barcodes to find out the flow direction of the eligible drugs.

Another example is to perform a search for clues during an investigation. The clues may include incomplete phone numbers, email addresses, license plates, IP addresses, QQ numbers, or WeChat numbers provided by users. According to available information along with the superposition of time, a fuzzy match and correlation can help to track the culprit.

Fuzzy Search Scenarios and Optimization Methods

To begin with, let’s take a look at the application scenarios classification and the optimization methods available in existing technologies.

1. Prefix Fuzzy Search: For example, LIKE ‘ABC%’, which can also be written as ~ ‘^ABC’ in PG.

2. Suffix Fuzzy Search: For example, LIKE ‘%ABC’, which can also be written as ~ ‘ABC$’ in PG.

3. Fully Fuzzy Search: For example, LIKE ‘%AB_C%’, which can also be written as ~ ‘AB.C’ in PG.

4. Regex Search: For example, ~ ‘[d]+def1.?[ a|b|0|8]{1,3}’

For all the four scenarios, you can use the GIN index for pg_trgm, or the multi-column index used for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)) for optimization.

The pg_trgm plug-in begins to support the use of indexes in fuzzy searches since PostgreSQL 9.1, and the use of indexes in regex searches since PostgreSQL 9.3. It greatly improves the ability of PostgreSQL in a criminal investigation.

For the code, you can refer to this document

The principle of the pg_trgm plug-in is to add 2 spaces before the string and 1 space after it to form a new string, and to split the new string into multiple token according to every 3 adjacent characters.

When a regular expression or fuzzy search is used for matching, the similarity of the two character strings will be retrieved before the filtering is done.

Example of the GIN index

When a matched token is retrieved from the B-tree, it points to the corresponding list and finds the corresponding record from the CTIDs stored in the list. Then, a string is split into many tokens, so that each time a record is inserted, multiple indexes are updated. This necessitates the GIN Fast Update.

So, if you are wondering how does the regex match work, refer to this document for more details.

In fact, it converts the regular expression to the NFA format and then scans multiple tokens for “bit and|or” matching. However, a large number of rechecks are required if many “bit and|or” results are obtained after conversion, which negatively impacts performance.

Let’s find out how to optimize the various fuzzy search scenarios with the help of examples in the following sections.

How to Optimize Different Fuzzy Search Scenarios: Examples

1. Prefix Fuzzy Search

Consider an example where, LIKE ‘ABC%’, can also be written as ~ ‘^ABC’ in PG. To optimize this you can use the B-tree index, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)). For example, following is the randomly generated first 8 characters of 10 million pieces of MD5 data.

postgres=# create table tb(info text);    
CREATE TABLE    
postgres=# insert into tb select substring(md5(random()::text),1,8) from generate_series(1,10000000);    
INSERT 0 10000000    
postgres=# create index idx_tb on tb(info);    
CREATE INDEX    
postgres=# select * from tb limit 1;    
   info       
----------    
 376821ab    
(1 row)    
postgres=# explain select * from tb where info ~ '^376821' limit 10;    
                                  QUERY PLAN                                       
-------------------------------------------------------------------------------    
 Limit  (cost=0.43..0.52 rows=10 width=9)    
   ->  Index Only Scan using idx_tb on tb  (cost=0.43..8.46 rows=1000 width=9)    
         Index Cond: ((info >= '376821'::text) AND (info < '376822'::text))    
         Filter: (info ~ '^376821'::text)    
(4 rows)    
postgres=# select * from tb where info ~ '^376821' limit 10;    
   info       
----------    
 376821ab    
(1 row)    
Time: 0.536 ms    
postgres=# set enable_indexscan=off;    
SET    
Time: 1.344 ms    
postgres=# set enable_bitmapscan=off;    
SET    
Time: 0.158 ms    
postgres=# explain select * from tb where info ~ '^376821' limit 10;    
                           QUERY PLAN                               
----------------------------------------------------------------    
 Limit  (cost=0.00..1790.55 rows=10 width=9)    
   ->  Seq Scan on tb  (cost=0.00..179055.00 rows=1000 width=9)    
         Filter: (info ~ '^376821'::text)    
(3 rows)    
Time: 0.505 ms

For prefix fuzzy searches, it takes 5483 milliseconds if no index is used and 0.5 milliseconds if an index is used.

postgres=# select * from tb where info ~ '^376821' limit 10;    
   info       
----------    
 376821ab    
(1 row)    
Time: 5483.655 ms

2. Suffix Fuzzy Search

Consider an example where LIKE ‘%ABC’, can also be written as ~ ‘ABC$’ in PG. To optimize this scenario, you can use the B-tree index for reverse function, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

postgres=# create index idx_tb1 on tb(reverse(info));    
CREATE INDEX    
postgres=# explain select * from tb where reverse(info) ~ '^ba128' limit 10;    
                                         QUERY PLAN                                             
--------------------------------------------------------------------------------------------    
 Limit  (cost=0.43..28.19 rows=10 width=9)    
   ->  Index Scan using idx_tb1 on tb  (cost=0.43..138778.43 rows=50000 width=9)    
         Index Cond: ((reverse(info) >= 'ba128'::text) AND (reverse(info) < 'ba129'::text))    
         Filter: (reverse(info) ~ '^ba128'::text)    
(4 rows)    
  
postgres=# select * from tb where reverse(info) ~ '^ba128' limit 10;    
   info       
----------    
 220821ab    
 671821ab    
 305821ab    
 e65821ab    
 536821ab    
 376821ab    
 668821ab    
 4d8821ab    
 26c821ab    
(9 rows)    
Time: 0.506 ms

For fuzzy suffix searches, it only takes 0.5 milliseconds when an index is used.

3. Fully Fuzzy Search

Consider an example where LIKE ‘%AB_C%’, can also be written as ~ ‘AB.C’ in PG. To optimize this scenario, you can use the GIN index for pg_trgm, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

postgres=# create extension pg_trgm;    
postgres=# explain select * from tb where info ~ '5821a';    
                                 QUERY PLAN                                     
----------------------------------------------------------------------------    
 Bitmap Heap Scan on tb  (cost=103.75..3677.71 rows=1000 width=9)    
   Recheck Cond: (info ~ '5821a'::text)    
   ->  Bitmap Index Scan on idx_tb_2  (cost=0.00..103.50 rows=1000 width=0)    
         Index Cond: (info ~ '5821a'::text)    
(4 rows)    
Time: 0.647 ms    
  
postgres=# select * from tb where info ~ '5821a';    
   info       
----------    
 5821a8a3    
 945821af    
 45821a74    
 9fe5821a    
 5821a7e0    
 5821af2a    
 1075821a    
 e5821ac9    
 d265821a    
 45f5821a    
 df5821a4    
 de5821af    
 71c5821a    
 375821a3    
 fc5821af    
 5c5821ad    
 e65821ab    
 5821adde    
 c35821a6    
 5821a642    
 305821ab    
 5821a1c8    
 75821a5c    
 ce95821a    
 a65821ad    
(25 rows)    
Time: 3.808 ms

For fully fuzzy searches, it only takes 3.8 milliseconds when an index is used.

4. Regex Search

Consider an example, ~ ‘[d]+def1.?[ a|b|0|8]{1,3}’. To optimize search performance in this case, you can use the GIN index for pg_trgm, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

For fully fuzzy searches, it only takes 108 milliseconds if an index is used.

postgres=# select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;    
   info       
----------    
 4e6567ab    
 1e6530ab    
 e6500ab8    
 ae6583ab    
 e6564ab7    
 5e6532ab    
 e6526abf    
 e6560ab6    
(8 rows)    
Time: 108.577 ms

It is important to note that time is mainly spent on exclusion operations.

14794 rows are retrieved, and 14793 rows are removed. A lot of time is spent on useless work, but it is still far better than the full-table scan.

postgres=# explain (verbose,analyze,buffers,costs,timing) select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;    
                                                            QUERY PLAN                                                                
----------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=511.75..547.49 rows=10 width=9) (actual time=89.934..120.567 rows=1 loops=1)    
   Output: info    
   Buffers: shared hit=13054    
   ->  Bitmap Heap Scan on public.tb  (cost=511.75..4085.71 rows=1000 width=9) (actual time=89.930..120.562 rows=1 loops=1)    
         Output: info    
         Recheck Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)    
         Rows Removed by Index Recheck: 14793    
         Heap Blocks: exact=12929    
         Buffers: shared hit=13054    
         ->  Bitmap Index Scan on idx_tb_2  (cost=0.00..511.50 rows=1000 width=0) (actual time=67.589..67.589 rows=14794 loops=1)    
               Index Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)    
               Buffers: shared hit=125    
 Planning time: 0.493 ms    
 Execution time: 120.618 ms    
(14 rows)    
Time: 124.693 ms

Performance Optimization

You need to consider performance issues when the GIN index is used because the info field is split into multiple char(3) tokens and a lot of index entries are involved. If the concurrency of insertion is high, it is best to set gin_pending_list_limit to a large value to improve insertion efficiency and reduce the RT increase caused by merging indexes in real-time.

After Fast Update is used, the pending information is automatically merged into the GIN index each time VACUUM operation is performed on the table. In addition, no merging action exists in the query. The GIN information that is not merged is searched through traversal.

Performance of High Concurrency Stress Testing

create table tbl(id serial8, crt_time timestamp, sensorid int, sensorloc point, info text) with (autovacuum_enabled=on, autovacuum_vacuum_threshold=0.000001,autovacuum_vacuum_cost_delay=0);    
CREATE INDEX trgm_idx ON tbl USING GIN (info gin_trgm_ops) with (fastupdate='on', gin_pending_list_limit='6553600');    
alter sequence tbl_id_seq cache 10000;

Modify the configuration, to ensure that the database autovacuum merges GIN quickly and iteratively.

vi $PGDATA/postgresql.conf    
autovacuum_naptime=30s    
maintenance_work_mem=1GB    
autovacuum_work_mem=1GB    
autovacuum = on    
autovacuum_max_workers = 3    
log_autovacuum_min_duration = 0    
autovacuum_vacuum_cost_delay=0    
  
$ pg_ctl reload

Create a test function to generate random test data.

postgres=# create or replace function f() returns void as $$    
  insert into tbl (crt_time,sensorid,info) values ( clock_timestamp(),trunc(random()*500000),substring(md5(random()::text),1,8) );    
$$ language sql strict;    
vi test.sql    
select f();    
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000    
  
progress: 50.0 s, 52800.9 tps, lat 0.453 ms stddev 0.390    
progress: 51.0 s, 52775.8 tps, lat 0.453 ms stddev 0.398    
progress: 52.0 s, 53173.2 tps, lat 0.449 ms stddev 0.371    
progress: 53.0 s, 53010.0 tps, lat 0.451 ms stddev 0.390    
progress: 54.0 s, 53360.9 tps, lat 0.448 ms stddev 0.365    
progress: 55.0 s, 53285.0 tps, lat 0.449 ms stddev 0.362    
progress: 56.0 s, 53662.1 tps, lat 0.445 ms stddev 0.368    
progress: 57.0 s, 53283.8 tps, lat 0.448 ms stddev 0.385    
progress: 58.0 s, 53703.4 tps, lat 0.445 ms stddev 0.355    
progress: 59.0 s, 53818.7 tps, lat 0.444 ms stddev 0.344    
progress: 60.0 s, 53889.2 tps, lat 0.443 ms stddev 0.361    
progress: 61.0 s, 53613.8 tps, lat 0.446 ms stddev 0.355    
progress: 62.0 s, 53339.9 tps, lat 0.448 ms stddev 0.392    
progress: 63.0 s, 54014.9 tps, lat 0.442 ms stddev 0.346    
progress: 64.0 s, 53112.1 tps, lat 0.450 ms stddev 0.374    
progress: 65.0 s, 53706.1 tps, lat 0.445 ms stddev 0.367    
progress: 66.0 s, 53720.9 tps, lat 0.445 ms stddev 0.353    
progress: 67.0 s, 52858.1 tps, lat 0.452 ms stddev 0.415    
progress: 68.0 s, 53218.9 tps, lat 0.449 ms stddev 0.387    
progress: 69.0 s, 53403.0 tps, lat 0.447 ms stddev 0.377    
progress: 70.0 s, 53179.9 tps, lat 0.449 ms stddev 0.377    
progress: 71.0 s, 53232.4 tps, lat 0.449 ms stddev 0.373    
progress: 72.0 s, 53011.7 tps, lat 0.451 ms stddev 0.386    
progress: 73.0 s, 52685.1 tps, lat 0.454 ms stddev 0.384    
progress: 74.0 s, 52937.8 tps, lat 0.452 ms stddev 0.377

At this speed, you can store more than 4 billion data records in the database each day.

Next, let’s see the example of splitting strings. This example applies to scenarios where the string length is fixed and small. If the string length is not fixed, this method is ineffective.

Although the split method is applicable, the test data is unsatisfactory. Therefore, it is more reliable to use pg_trgm instead.

postgres=# create table t_split(id int, crt_time timestamp, sensorid int, sensorloc point, info text, c1 char(1), c2 char(1), c3 char(1), c4 char(1), c5 char(1), c6 char(1), c7 char(1), c8 char(1));    
CREATE TABLE    
Time: 2.123 ms    
  
postgres=# insert into t_split(id,crt_time,sensorid,info,c1,c2,c3,c4,c5,c6,c7,c8) select id,ct,sen,info,substring(info,1,1),substring(info,2,1),substring(info,3,1),substring(info,4,1),substring(info,5,1),substring(info,6,1),substring(info,7,1),substring(info,8,1) from (select id, clock_timestamp() ct, trunc(random()*500000) sen, substring(md5(random()::text), 1, 8) info from generate_series(1,10000000) t(id)) t;    
INSERT 0 10000000    
Time: 81829.274 ms    
  
postgres=# create index idx1 on t_split (c1);    
postgres=# create index idx2 on t_split (c2);    
postgres=# create index idx3 on t_split (c3);    
postgres=# create index idx4 on t_split (c4);    
postgres=# create index idx5 on t_split (c5);    
postgres=# create index idx6 on t_split (c6);    
postgres=# create index idx7 on t_split (c7);    
postgres=# create index idx8 on t_split (c8);    
postgres=# create index idx9 on t_split using gin (info gin_trgm_ops);    
  
postgres=# select * from t_split limit 1;    
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
  1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
(1 row)    
  
postgres=# select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;    
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
  1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
(1 row)    
Time: 133.041 ms    
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;    
                                                            QUERY PLAN                                                                
----------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=575.75..612.78 rows=10 width=57) (actual time=92.406..129.838 rows=1 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
   Buffers: shared hit=13798    
   ->  Bitmap Heap Scan on public.t_split  (cost=575.75..4278.56 rows=1000 width=57) (actual time=92.403..129.833 rows=1 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
         Recheck Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)    
         Rows Removed by Index Recheck: 14690    
         Heap Blocks: exact=13669    
         Buffers: shared hit=13798    
         ->  Bitmap Index Scan on idx9  (cost=0.00..575.50 rows=1000 width=0) (actual time=89.576..89.576 rows=14691 loops=1)    
               Index Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)    
               Buffers: shared hit=129    
 Planning time: 0.385 ms    
 Execution time: 129.883 ms    
(14 rows)    
  
Time: 130.678 ms    
  
  
postgres=# select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;    
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
  1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
(1 row)    
  
Time: 337.367 ms    
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;    
                                                                                                                 QUERY PLAN                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.230..344.675 rows=1 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
   Buffers: shared hit=7581    
   ->  Bitmap Heap Scan on public.t_split  (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.228..344.673 rows=1 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
         Recheck Cond: ((t_split.c3 = 'e'::bpchar) AND (t_split.c8 = '9'::bpchar) AND (t_split.c5 = 'd'::bpchar))    
         Filter: ((t_split.c2 >= '0'::bpchar) AND (t_split.c2 <= '9'::bpchar) AND (t_split.c6 >= '0'::bpchar) AND (t_split.c6 <= '9'::bpchar) AND (t_split.c1 = '3'::bpchar) AND (t_split.c4 = 'e'::bpchar) AND (t_split.c7 = '7'::bpchar))    
         Rows Removed by Filter: 2480    
         Heap Blocks: exact=2450    
         Buffers: shared hit=7581    
         ->  BitmapAnd  (cost=33582.31..33582.31 rows=2224 width=0) (actual time=338.512..338.512 rows=0 loops=1)    
               Buffers: shared hit=5131    
               ->  Bitmap Index Scan on idx3  (cost=0.00..11016.93 rows=596333 width=0) (actual time=104.418..104.418 rows=624930 loops=1)    
                     Index Cond: (t_split.c3 = 'e'::bpchar)    
                     Buffers: shared hit=1711    
               ->  Bitmap Index Scan on idx8  (cost=0.00..11245.44 rows=608667 width=0) (actual time=100.185..100.185 rows=625739 loops=1)    
                     Index Cond: (t_split.c8 = '9'::bpchar)    
                     Buffers: shared hit=1712    
               ->  Bitmap Index Scan on idx5  (cost=0.00..11319.44 rows=612667 width=0) (actual time=99.480..99.480 rows=624269 loops=1)    
                     Index Cond: (t_split.c5 = 'd'::bpchar)    
                     Buffers: shared hit=1708    
 Planning time: 0.262 ms    
 Execution time: 344.731 ms    
(23 rows)    
  
Time: 346.424 ms    
  
postgres=# select * from t_split where info ~ '^33.+7.+9$' limit 10;    
   id   |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
      1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
  24412 | 2016-03-02 09:58:04.186359 |   251599 |           | 33f07429 | 3  | 3  | f  | 0  | 7  | 4  | 2  | 9    
  24989 | 2016-03-02 09:58:04.191112 |   214569 |           | 334587d9 | 3  | 3  | 4  | 5  | 8  | 7  | d  | 9    
  50100 | 2016-03-02 09:58:04.398499 |   409819 |           | 33beb7b9 | 3  | 3  | b  | e  | b  | 7  | b  | 9    
  92623 | 2016-03-02 09:58:04.745372 |   280100 |           | 3373e719 | 3  | 3  | 7  | 3  | e  | 7  | 1  | 9    
 106054 | 2016-03-02 09:58:04.855627 |   155192 |           | 33c575c9 | 3  | 3  | c  | 5  | 7  | 5  | c  | 9    
 107070 | 2016-03-02 09:58:04.863827 |   464325 |           | 337dd729 | 3  | 3  | 7  | d  | d  | 7  | 2  | 9    
 135152 | 2016-03-02 09:58:05.088217 |   240500 |           | 336271d9 | 3  | 3  | 6  | 2  | 7  | 1  | d  | 9    
 156425 | 2016-03-02 09:58:05.25805  |   218202 |           | 333e7289 | 3  | 3  | 3  | e  | 7  | 2  | 8  | 9    
 170210 | 2016-03-02 09:58:05.368371 |   132530 |           | 33a8d789 | 3  | 3  | a  | 8  | d  | 7  | 8  | 9    
(10 rows)    
  
Time: 20.431 ms    
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^33.+7.+9$' limit 10;    
                                                           QUERY PLAN                                                                
---------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=43.75..80.78 rows=10 width=57) (actual time=19.573..21.212 rows=10 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
   Buffers: shared hit=566    
   ->  Bitmap Heap Scan on public.t_split  (cost=43.75..3746.56 rows=1000 width=57) (actual time=19.571..21.206 rows=10 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
         Recheck Cond: (t_split.info ~ '^33.+7.+9$'::text)    
         Rows Removed by Index Recheck: 647    
         Heap Blocks: exact=552    
         Buffers: shared hit=566    
         ->  Bitmap Index Scan on idx9  (cost=0.00..43.50 rows=1000 width=0) (actual time=11.712..11.712 rows=39436 loops=1)    
               Index Cond: (t_split.info ~ '^33.+7.+9$'::text)    
               Buffers: shared hit=14    
 Planning time: 0.301 ms    
 Execution time: 21.255 ms    
(14 rows)    
  
Time: 21.995 ms    
  
  
postgres=# select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;    
   id   |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
      1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
  24412 | 2016-03-02 09:58:04.186359 |   251599 |           | 33f07429 | 3  | 3  | f  | 0  | 7  | 4  | 2  | 9    
  24989 | 2016-03-02 09:58:04.191112 |   214569 |           | 334587d9 | 3  | 3  | 4  | 5  | 8  | 7  | d  | 9    
  50100 | 2016-03-02 09:58:04.398499 |   409819 |           | 33beb7b9 | 3  | 3  | b  | e  | b  | 7  | b  | 9    
  92623 | 2016-03-02 09:58:04.745372 |   280100 |           | 3373e719 | 3  | 3  | 7  | 3  | e  | 7  | 1  | 9    
 106054 | 2016-03-02 09:58:04.855627 |   155192 |           | 33c575c9 | 3  | 3  | c  | 5  | 7  | 5  | c  | 9    
 107070 | 2016-03-02 09:58:04.863827 |   464325 |           | 337dd729 | 3  | 3  | 7  | d  | d  | 7  | 2  | 9    
 135152 | 2016-03-02 09:58:05.088217 |   240500 |           | 336271d9 | 3  | 3  | 6  | 2  | 7  | 1  | d  | 9    
 156425 | 2016-03-02 09:58:05.25805  |   218202 |           | 333e7289 | 3  | 3  | 3  | e  | 7  | 2  | 8  | 9    
 170210 | 2016-03-02 09:58:05.368371 |   132530 |           | 33a8d789 | 3  | 3  | a  | 8  | d  | 7  | 8  | 9    
(10 rows)    
  
Time: 37.739 ms    
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;    
                                                                                               QUERY PLAN                                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=0.00..8135.78 rows=10 width=57) (actual time=0.017..35.532 rows=10 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
   Buffers: shared hit=1755    
   ->  Seq Scan on public.t_split  (cost=0.00..353093.00 rows=434 width=57) (actual time=0.015..35.526 rows=10 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8    
         Filter: ((t_split.c1 = '3'::bpchar) AND (t_split.c2 = '3'::bpchar) AND (t_split.c8 = '9'::bpchar) AND ((t_split.c4 = '7'::bpchar) OR (t_split.c5 = '7'::bpchar) OR (t_split.c6 = '7'::bpchar)))    
         Rows Removed by Filter: 170200    
         Buffers: shared hit=1755    
 Planning time: 0.210 ms    
 Execution time: 35.572 ms    
(10 rows)    
  
Time: 36.260 ms    
  
postgres=# select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;    
   id    |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
---------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
       1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
 1308724 | 2016-03-02 09:58:14.590901 |   458822 |           | 3fed9479 | 3  | f  | e  | d  | 9  | 4  | 7  | 9    
 2866024 | 2016-03-02 09:58:27.20105  |   106467 |           | 3fed2279 | 3  | f  | e  | d  | 2  | 2  | 7  | 9    
 4826729 | 2016-03-02 09:58:42.907431 |   228023 |           | 3ded9879 | 3  | d  | e  | d  | 9  | 8  | 7  | 9    
 6113373 | 2016-03-02 09:58:53.211146 |   499702 |           | 36fed479 | 3  | 6  | f  | e  | d  | 4  | 7  | 9    
 1768237 | 2016-03-02 09:58:18.310069 |   345027 |           | 30fed079 | 3  | 0  | f  | e  | d  | 0  | 7  | 9    
 1472324 | 2016-03-02 09:58:15.913629 |   413283 |           | 3eed5798 | 3  | e  | e  | d  | 5  | 7  | 9  | 8    
 8319056 | 2016-03-02 09:59:10.902137 |   336740 |           | 3ded7790 | 3  | d  | e  | d  | 7  | 7  | 9  | 0    
 8576573 | 2016-03-02 09:59:12.962923 |   130223 |           | 3eed5793 | 3  | e  | e  | d  | 5  | 7  | 9  | 3    
(9 rows)    
  
Time: 268.661 ms    
  
postgres=# explain (analyze,verbose,timing,buffers,costs) select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;    
                                                               QUERY PLAN                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=4302.66..4302.69 rows=10 width=57) (actual time=269.214..269.217 rows=9 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))    
   Buffers: shared hit=52606    
   ->  Sort  (cost=4302.66..4305.16 rows=1000 width=57) (actual time=269.212..269.212 rows=9 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))    
         Sort Key: ((t_split.info <-> '^3.?[b-g]+ed[\d]+79'::text))    
         Sort Method: quicksort  Memory: 26kB    
         Buffers: shared hit=52606    
         ->  Bitmap Heap Scan on public.t_split  (cost=575.75..4281.06 rows=1000 width=57) (actual time=100.771..269.180 rows=9 loops=1)    
               Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, (info <-> '^3.?[b-g]+ed[\d]+79'::text)    
               Recheck Cond: (t_split.info ~ '^3.?[b-g]+ed[\d]+79'::text)    
               Rows Removed by Index Recheck: 72929    
               Heap Blocks: exact=52479    
               Buffers: shared hit=52606    
               ->  Bitmap Index Scan on idx9  (cost=0.00..575.50 rows=1000 width=0) (actual time=88.062..88.062 rows=72938 loops=1)    
                     Index Cond: (t_split.info ~ '^3.?[b-g]+ed[\d]+79'::text)    
                     Buffers: shared hit=127    
 Planning time: 0.640 ms    
 Execution time: 269.287 ms    
(19 rows)    
  
Time: 270.430 ms    
  
postgres=# select * from t_split where info ~ '3.?[b-g]+ed[\d]+79' order by info <-> '3.?[b-g]+ed[\d]+79' limit 10;    
   id    |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8     
---------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----    
       1 | 2016-03-02 09:58:03.990639 |   161958 |           | 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9    
 1308724 | 2016-03-02 09:58:14.590901 |   458822 |           | 3fed9479 | 3  | f  | e  | d  | 9  | 4  | 7  | 9    
 4826729 | 2016-03-02 09:58:42.907431 |   228023 |           | 3ded9879 | 3  | d  | e  | d  | 9  | 8  | 7  | 9    
 5250603 | 2016-03-02 09:58:46.300289 |   250582 |           | d3eed179 | d  | 3  | e  | e  | d  | 1  | 7  | 9    
 6113373 | 2016-03-02 09:58:53.211146 |   499702 |           | 36fed479 | 3  | 6  | f  | e  | d  | 4  | 7  | 9    
 1768237 | 2016-03-02 09:58:18.310069 |   345027 |           | 30fed079 | 3  | 0  | f  | e  | d  | 0  | 7  | 9    
 2866024 | 2016-03-02 09:58:27.20105  |   106467 |           | 3fed2279 | 3  | f  | e  | d  | 2  | 2  | 7  | 9    
 1472324 | 2016-03-02 09:58:15.913629 |   413283 |           | 3eed5798 | 3  | e  | e  | d  | 5  | 7  | 9  | 8    
 8576573 | 2016-03-02 09:59:12.962923 |   130223 |           | 3eed5793 | 3  | e  | e  | d  | 5  | 7  | 9  | 3    
 8319056 | 2016-03-02 09:59:10.902137 |   336740 |           | 3ded7790 | 3  | d  | e  | d  | 7  | 7  | 9  | 0    
(10 rows)    
  
Time: 320.525 ms    
  
postgres=# explain (analyze,verbose,buffers,costs,timing) select * from t_split where info ~ '3.?[b-g]+ed[\d]+79' order by info <-> '3.?[b-g]+ed[\d]+79' limit 10;    
                                                                QUERY PLAN                                                                    
------------------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=4302.66..4302.69 rows=10 width=57) (actual time=319.925..319.927 rows=10 loops=1)    
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '3.?[b-g]+ed[\d]+79'::text))    
   Buffers: shared hit=52602    
   ->  Sort  (cost=4302.66..4305.16 rows=1000 width=57) (actual time=319.923..319.923 rows=10 loops=1)    
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '3.?[b-g]+ed[\d]+79'::text))    
         Sort Key: ((t_split.info <-> '3.?[b-g]+ed[\d]+79'::text))    
         Sort Method: quicksort  Memory: 26kB    
         Buffers: shared hit=52602    
         ->  Bitmap Heap Scan on public.t_split  (cost=575.75..4281.06 rows=1000 width=57) (actual time=104.526..319.885 rows=10 loops=1)    
               Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, (info <-> '3.?[b-g]+ed[\d]+79'::text)    
               Recheck Cond: (t_split.info ~ '3.?[b-g]+ed[\d]+79'::text)    
               Rows Removed by Index Recheck: 72928    
               Heap Blocks: exact=52479    
               Buffers: shared hit=52602    
               ->  Bitmap Index Scan on idx9  (cost=0.00..575.50 rows=1000 width=0) (actual time=91.945..91.945 rows=72938 loops=1)    
                     Index Cond: (t_split.info ~ '3.?[b-g]+ed[\d]+79'::text)    
                     Buffers: shared hit=123    
 Planning time: 0.948 ms    
 Execution time: 320.003 ms    
(19 rows)    
  
Time: 321.502 ms

Performance Test For Large Data Volume

Simulate a partition table, with one partition per hour. The data volume is 50 million per hour, 1.2 billion per day, and 36 billion per month.

drop table tbl cascade;    
create table tbl (id serial8, crt_time timestamp, sensorid int, sensorloc point, info text);    
  
do language plpgsql $$    
declare    
  v_s timestamp := '2016-01-01 00:00:00';    
begin    
  for i in 1..720 loop    
    execute 'create table tbl_'||to_char(v_s,'yyyymmddhh24')||' (id int8 not null default nextval(''tbl_id_seq''::regclass), crt_time timestamp check (crt_time >= '''||to_char(v_s,'yyyy-mm-dd hh24:mi:ss')||''' and crt_time <'''||to_char(v_s+'1 h'::interval,'yyyy-mm-dd hh24:mi:ss')||'''), sensorid int, sensorloc point, info text) inherits (tbl)';    
    v_s := v_s + '1 h'::interval;    
  end loop;    
end;    
$$;    
  
alter sequence tbl_id_seq cache 100000;

Generate INSERT SQL.

do language plpgsql $$    
declare    
  v_s timestamp := '2016-01-01 00:00:00';    
begin    
  for i in 1..720 loop    
    raise notice '%', 'psql -c "insert into tbl_'||to_char(v_s,'yyyymmddhh24')||' (crt_time, sensorid, info) select '''||to_char(v_s,'yyyy-mm-dd hh24:mi:ss')||''',trunc(random()*500000), substring(md5(random()::text),1,8) from generate_series(1,50000000);" &';    
    v_s := v_s + '1 h'::interval;    
  end loop;    
end;    
$$;

Performance metric: For 50 million data records in a single table, the range scan returns the results in milliseconds.

postgres=# explain (analyze,verbose,timing,buffers,costs) select * from tbl where crt_time between '2016-01-01 12:00:00' and '2016-01-01 12:30:00' and info ~ 'f[\d]{2}e27e0$' order by info <-> 'f[\d]{2}e27e0$' limit 10;    
                                                                                                        QUERY PLAN                                                                                                             
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    
 Limit  (cost=18918.83..18918.85 rows=10 width=45) (actual time=350.296..350.297 rows=2 loops=1)    
   Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, ((tbl.info <-> 'f[\d]{2}e27e0$'::text))    
   Buffers: shared hit=4530    
   ->  Sort  (cost=18918.83..18931.33 rows=5001 width=45) (actual time=350.294..350.295 rows=2 loops=1)    
         Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, ((tbl.info <-> 'f[\d]{2}e27e0$'::text))    
         Sort Key: ((tbl.info <-> 'f[\d]{2}e27e0$'::text))    
         Sort Method: quicksort  Memory: 25kB    
         Buffers: shared hit=4530    
         ->  Result  (cost=0.00..18810.76 rows=5001 width=45) (actual time=347.995..350.279 rows=2 loops=1)    
               Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, (tbl.info <-> 'f[\d]{2}e27e0$'::text)    
               Buffers: shared hit=4530    
               ->  Append  (cost=0.00..18798.26 rows=5001 width=45) (actual time=347.976..350.254 rows=2 loops=1)    
                     Buffers: shared hit=4530    
                     ->  Seq Scan on public.tbl  (cost=0.00..0.00 rows=1 width=68) (actual time=0.001..0.001 rows=0 loops=1)    
                           Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info    
                           Filter: ((tbl.crt_time >= '2016-01-01 12:00:00'::timestamp without time zone) AND (tbl.crt_time <= '2016-01-01 12:30:00'::timestamp without time zone) AND (tbl.info ~ 'f[\d]{2}e27e0$'::text))    
                     ->  Bitmap Heap Scan on public.tbl_2016010112  (cost=574.75..18798.26 rows=5000 width=45) (actual time=347.972..350.249 rows=2 loops=1)    
                           Output: tbl_2016010112.id, tbl_2016010112.crt_time, tbl_2016010112.sensorid, tbl_2016010112.sensorloc, tbl_2016010112.info    
                           Recheck Cond: (tbl_2016010112.info ~ 'f[\d]{2}e27e0$'::text)    
                           Rows Removed by Index Recheck: 4100    
                           Filter: ((tbl_2016010112.crt_time >= '2016-01-01 12:00:00'::timestamp without time zone) AND (tbl_2016010112.crt_time <= '2016-01-01 12:30:00'::timestamp without time zone))    
                           Heap Blocks: exact=4085    
                           Buffers: shared hit=4530    
                           ->  Bitmap Index Scan on idx_tbl_2016010112  (cost=0.00..573.50 rows=5000 width=0) (actual time=337.125..337.125 rows=4102 loops=1)    
                                 Index Cond: (tbl_2016010112.info ~ 'f[\d]{2}e27e0$'::text)    
                                 Buffers: shared hit=445    
 Planning time: 23.913 ms    
 Execution time: 350.383 ms    
(28 rows)    
  
postgres=# select * from tbl where crt_time between '2016-01-01 12:00:00' and '2016-01-01 12:30:00' and info ~ 'f[\d]{2}e27e0$' order by info <-> 'f[\d]{2}e27e0$' limit 10;    
     id     |      crt_time       | sensorid | sensorloc |   info       
------------+---------------------+----------+-----------+----------    
 1982100172 | 2016-01-01 12:00:00 |   336772 |           | f48e27e0    
 2292713691 | 2016-01-01 12:00:00 |   489110 |           | f77e27e0    
(2 rows)

Performance Test of Regex search and fuzzy search in a Single Table with 14.4 Billion Data Records

postgres=# \dt+ t_all    
                    List of relations    
 Schema | Name  | Type  |  Owner   |  Size  | Description     
--------+-------+-------+----------+--------+-------------    
 public | t_all | table | postgres | 811 GB |     
(1 row)    
  
postgres=# \d t_all    
                Table "public.t_all"    
  Column   |            Type             | Modifiers     
-----------+-----------------------------+-----------    
 id        | bigint                      | not null    
 crt_time  | timestamp without time zone |     
 sensorid  | integer                     |     
 sensorloc | point                       |     
 info      | text                        |     
  
postgres=# select count(*) from t_all;    
    count       
-------------    
 14456717312    
(1 row)    
  
postgres=# select * from t_all limit 10;    
     id     |      crt_time       | sensorid | sensorloc |   info       
------------+---------------------+----------+-----------+----------    
 6519272065 | 2016-01-06 10:00:00 |   493013 |           | 62255c83    
 6519272066 | 2016-01-06 10:00:00 |   309676 |           | f6c98800    
 6519272067 | 2016-01-06 10:00:00 |    43859 |           | 125a1191    
 6519272068 | 2016-01-06 10:00:00 |   495624 |           | e75cfd71    
 6519272069 | 2016-01-06 10:00:00 |    10362 |           | 7171f11c    
 6519272070 | 2016-01-06 10:00:00 |   231476 |           | 4201f809    
 6519272071 | 2016-01-06 10:00:00 |    43080 |           | a47e84e5    
 6519272072 | 2016-01-06 10:00:00 |   131292 |           | 17bc248e    
 6519272073 | 2016-01-06 10:00:00 |   486841 |           | 3303097c    
 6519272074 | 2016-01-06 10:00:00 |   491503 |           | f0c53fee    
(10 rows)

The test data is subsequently released, and it is observed that there is no problem in returning the results in seconds after sharding the table.

There is high efficiency because the bottleneck is not on IO, but mainly on the rechecking the data. If the 14.4 billion data records are split into 29 tables (each table contains about 500 million data records) and parallelly executed, it is possible to obtain the results in seconds.

Let’s take a look at the test results for 500 million data records in a single table. The results are returned in seconds.

postgres=# explain (verbose,analyze,buffers,timing,costs) select * from tbl1 where info ~ 'aad.+f02' limit 10;  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=1439.79..1476.19 rows=10 width=29) (actual time=116.570..116.719 rows=10 loops=1)  
   Output: id, crt_time, sensorid, info  
   Buffers: shared hit=680  
   ->  Bitmap Heap Scan on public.tbl1  (cost=1439.79..191054.88 rows=52103 width=29) (actual time=116.568..116.716 rows=10 loops=1)  
         Output: id, crt_time, sensorid, info  
         Recheck Cond: (tbl1.info ~ 'aad.+f02'::text)  
         Rows Removed by Index Recheck: 38  
         Heap Blocks: exact=48  
         Buffers: shared hit=680  
         ->  Bitmap Index Scan on tbl1_info_idx  (cost=0.00..1426.77 rows=52103 width=0) (actual time=116.495..116.495 rows=403 loops=1)  
               Index Cond: (tbl1.info ~ 'aad.+f02'::text)  
               Buffers: shared hit=632  
 Planning time: 0.311 ms  
 Execution time: 116.754 ms  
(14 rows)  
  
Time: 117.422 ms  
postgres=# select * from tbl1 where info ~ 'aad.+f02' limit 10;  
    id     |          crt_time          | sensorid |   info     
-----------+----------------------------+----------+----------  
  17986922 | 2016-02-29 17:42:42.427639 |    75863 | aad3f02a  
  19873247 | 2016-02-29 17:43:16.714945 |   174971 | 2aad5f02  
  23798336 | 2016-02-29 17:44:35.369654 |   202085 | aad06f02  
  28630866 | 2016-02-29 17:46:03.544462 |   463184 | baad3f02  
  31458529 | 2016-02-29 17:47:00.300937 |   411670 | aad1af02  
  52009687 | 2016-02-29 17:53:15.466246 |   192821 | 5aad6f02  
  80769909 | 2016-02-29 18:01:31.074248 |    47993 | aadcf029  
  80825896 | 2016-02-29 18:01:31.039063 |   284712 | aad14f02  
  83385996 | 2016-02-29 18:02:12.699317 |    78233 | daadcf02  
 102814316 | 2016-02-29 18:08:20.891412 |   359635 | aad06f02  
(10 rows)  
  
Time: 116.901 ms

The following shows the time required for the full-table scan.

postgres=# set enable_bitmapscan=off;  
SET  
Time: 0.145 ms  
postgres=# select * from tbl1 where info ~ 'aad.+f02' limit 10;  
    id     |          crt_time          | sensorid |   info     
-----------+----------------------------+----------+----------  
  31458529 | 2016-02-29 17:47:00.300937 |   411670 | aad1af02  
  52009687 | 2016-02-29 17:53:15.466246 |   192821 | 5aad6f02  
  80769909 | 2016-02-29 18:01:31.074248 |    47993 | aadcf029  
  80825896 | 2016-02-29 18:01:31.039063 |   284712 | aad14f02  
  83385996 | 2016-02-29 18:02:12.699317 |    78233 | daadcf02  
 102814316 | 2016-02-29 18:08:20.891412 |   359635 | aad06f02  
 105236847 | 2016-02-29 18:09:56.914795 |      876 | aadbf026  
 108524272 | 2016-02-29 18:10:47.39312  |   338071 | 2aad2f02  
 128169786 | 2016-02-29 18:17:52.105948 |   262400 | aad0f028  
 135935810 | 2016-02-29 18:20:43.265139 |   487673 | aad7f021  
(10 rows)  
Time: 98903.073 ms

Performance Comparison Chart

The following figure shows the comparison in the case of 10 million data records.

The following figure shows the comparison in the case of 500 million data records.

The following figure shows the comparison between B-tree “bit or|and” and GIN in the case of 10 million pieces of data.

The following figure shows the comparison in the case of 14.4 billion data records of the partition table.

Optimization methods under large data volume: For example, how to achieve the second-level fuzzy query response for data volume above 10 billion records.

For a single machine, you can use partitions and parallel searches to make full use of the CPU function.

Alternatively, you can use the MPP and SHARDING architecture to utilize the resources of multiple machines.

Principle: To reduce the number of rechecks, try to scan and find the final desired result (a large number of scans and a large number of Remove Checked False Rows exist for full-table scan and indexes).

Additional Information

1. A brief description of the test host configuration

Common x86, 24-core, 1 xxg memory, and SSD.

2. Reference question from a friend

Digoal: Why do you use the first 8 characters of generated MD5 data? The minimum number of MD5 characters is 16. What is the mathematical reason behind this?

Reply: The result of MD5 computation is a 128-bit hex, and PostgreSQL outputs the result in the form of text, so what we see is a 32-bit string.

rfc1321  
  
3.1 Step 1. Append Padding Bits  
3.2 Step 2. Append Length  
3.3 Step 3. Initialize MD Buffer  
3.4 Step 4. Process Message in 16-Word Blocks  
3.5 Step 5. Output  
  
4. Summary     The MD5 message-digest algorithm is simple to implement, and provides  
   a "fingerprint" or message digest of a message of arbitrary length.  
   It is conjectured that the difficulty of coming up with two messages  
   having the same message digest is on the order of 2^64 operations,  
   and that the difficulty of coming up with any message having a given  
   message digest is on the order of 2^128 operations. The MD5 algorithm  
   has been carefully scrutinized for weaknesses. It is, however, a  
   relatively new algorithm and further security analysis is of course  
   justified, as is the case with any new proposal of this sort.

PG code.

/*  
 * Create an md5 hash of a text string and return it as hex  
 *  
 * md5 produces a 16 byte (128 bit) hash; double it for hex  
 */  
#define MD5_HASH_LEN  32  
  
Datum  
md5_text(PG_FUNCTION_ARGS)  
{  
    text       *in_text = PG_GETARG_TEXT_PP(0);  
    size_t      len;  
    char        hexsum[MD5_HASH_LEN + 1];  
  
    /* Calculate the length of the buffer using varlena metadata */  
    len = VARSIZE_ANY_EXHDR(in_text);  
  
    /* get the hash result */  
    if (pg_md5_hash(VARDATA_ANY(in_text), len, hexsum) == false)  
        ereport(ERROR,  
                (errcode(ERRCODE_OUT_OF_MEMORY),  
                 errmsg("out of memory")));  
  
    /* convert to text and return it */  
    PG_RETURN_TEXT_P(cstring_to_text(hexsum));  
}  
  
/*  
 *  pg_md5_hash  
 *  
 *  Calculates the MD5 sum of the bytes in a buffer.  
 *  
 *  SYNOPSIS      #include "md5.h"  
 *                int pg_md5_hash(const void *buff, size_t len, char *hexsum)  
 *  
 *  INPUT         buff    the buffer containing the bytes that you want  
 *                        the MD5 sum of.  
 *                len     number of bytes in the buffer.  
 *  
 *  OUTPUT        hexsum  the MD5 sum as a '\0'-terminated string of  
 *                        hexadecimal digits.  an MD5 sum is 16 bytes long.  
 *                        each byte is represented by two heaxadecimal  
 *                        characters.  you thus need to provide an array  
 *                        of 33 characters, including the trailing '\0'.  
 *  
 *  RETURNS       false on failure (out of memory for internal buffers) or  
 *                true on success.  
 *  
 *  STANDARDS     MD5 is described in RFC 1321.  
 *  
 *  AUTHOR        Sverre H. Huseby <sverrehu@online.no>  
 *  
 */  
bool  
pg_md5_hash(const void *buff, size_t len, char *hexsum)  
{  
    uint8       sum[16];  
  
    if (!calculateDigestFromBuffer(buff, len, sum))  
        return false;  
  
    bytesToHex(sum, hexsum);  
    return true;  
}  
  
static int  
calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])  
{  
    register uint32 i,  
                j,  
                k,  
                newI;  
    uint32      l;  
    uint8      *input;  
    register uint32 *wbp;  
    uint32      workBuff[16],  
                state[4];  
  
    l = len;  
  
    state[0] = 0x67452301;  
    state[1] = 0xEFCDAB89;  
    state[2] = 0x98BADCFE;  
    state[3] = 0x10325476;  
  
    if ((input = createPaddedCopyWithLength(b, &l)) == NULL)  
        return 0;  
  
    for (i = 0;;)  
    {  
        if ((newI = i + 16 * 4) > l)  
            break;  
        k = i + 3;  
        for (j = 0; j < 16; j++)  
        {  
            wbp = (workBuff + j);  
            *wbp = input[k--];  
            *wbp <<= 8;  
            *wbp |= input[k--];  
            *wbp <<= 8;  
            *wbp |= input[k--];  
            *wbp <<= 8;  
            *wbp |= input[k];  
            k += 7;  
        }  
        doTheRounds(workBuff, state);  
        i = newI;  
    }  
    free(input);  
  
    j = 0;  
    for (i = 0; i < 4; i++)  
    {  
        k = state[i];  
        sum[j++] = (k & 0xff);  
        k >>= 8;  
        sum[j++] = (k & 0xff);  
        k >>= 8;  
        sum[j++] = (k & 0xff);  
        k >>= 8;  
        sum[j++] = (k & 0xff);  
    }  
    return 1;  
}  
  
  
postgres=# select length(md5(random()::text)),md5(random()::text) from generate_series(1,10);  
 length |               md5                  
--------+----------------------------------  
     32 | 4b0fcafd0f9d189cceade2812e4aa396  
     32 | 9169f49f0c059a65de3325a20068eb8e  
     32 | 7df4b60638972f1372fde91e3d2eee50  
     32 | 4fc6816aa88224163dda3e242f3e16f0  
     32 | 518870a2c8a9cfe6cba018916f5388a9  
     32 | b46b63c409532e5c973ddc27a5e49ce4  
     32 | e24e9d63926094046aa2b300dfa8e986  
     32 | 3f1ca7c5ce5159a0df17729f81e24925  
     32 | 5756f0925eca4c3801c4a49cf9b68023  
     32 | ed54795fcbe9491f4e5a00ec2cf323ee  
(10 rows)

In addition, the purpose of intercepting the first 8 characters is to generate an 8-bit random string, which is a compromise in the length of data types, such as license plates, QQ numbers, and bar codes.

In fact, you can test a 16-bit or 32-bit random string. The test data for the corresponding length will be updated later.

Test for 32-bit Strings

Below is the performance chart for the regex search and fuzzy search in case of 32-bit strings of 1 billion records on a single machine.

Insert 1 billion data records, and store 32-bit long random strings for each record.

postgres=# create table t_regexp (info text);    
CREATE TABLE    
postgres=# insert into t_regexp select md5(random()::text) from generate_series(1,1000000000);    
INSERT 0 1000000000    
postgres=# create index idx_t_regexp_gin on t_regexp using gin (info gin_trgm_ops);    
CREATE INDEX    
postgres=# create index idx_t_regexp_1 on t_regexp (info);    
CREATE INDEX    
postgres=# create index idx_t_regexp_2 on t_regexp (reverse(info));    
CREATE INDEX

Use auto_explain to collect performance test data.

load 'auto_explain';      
set auto_explain.log_analyze =true;       
set auto_explain.log_buffers =true;      
set auto_explain.log_nested_statements=true;      
set auto_explain.log_timing=true;      
set auto_explain.log_triggers=true;      
set auto_explain.log_verbose=true;      
set auto_explain.log_min_duration=0;      
set client_min_messages ='log';      
set work_mem='2GB';

Test Data Overview: The total number of records is 1 billion, and the unique number of records is about 800 million.

digoal=> select count(*) from t_regexp ;    
   count        
------------    
 1000000000    
(1 row)    
Time: 18547.746 ms    
  
digoal=> select count(*) from (select info from t_regexp group by info) t;    
   count       
-----------    
 799444838    
(1 row)

Find below the prefix fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7e' limit 10;    
LOG:  duration: 0.351 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7e' limit 10;    
Limit  (cost=0.15..8.17 rows=1 width=39) (actual time=0.152..0.346 rows=2 loops=1)    
  Output: ctid, info    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=0.15..8.17 rows=1 width=39) (actual time=0.150..0.342 rows=2 loops=1)    
        Output: ctid, info    
     ctid     |               info                   
--------------+----------------------------------    
 (14720,1)    | 9def5fe6343a5938b23af38444b7e350    
 (379127,105) | 9def5fe6343a5938b23af38444b7e350    
(2 rows)    
Time: 7.952 ms

Find below the prefix fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7a' limit 10;    
LOG:  duration: 0.447 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7a' limit 10;    
Limit  (cost=0.15..8.17 rows=1 width=39) (actual time=0.443..0.443 rows=0 loops=1)    
  Output: ctid, info    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=0.15..8.17 rows=1 width=39) (actual time=0.441..0.441 rows=0 loops=1)    
        Output: ctid, info    
 ctid | info     
------+------    
(0 rows)    
Time: 7.968 ms

Find below the suffix fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495';    
LOG:  duration: 5.287 ms  plan:    
Query Text: select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495';    
Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0) (actual time=4.943..5.281 rows=1 loops=1)    
  Output: t_regexp.ctid, t_regexp.info    
  Node/s: data1, data10, data11, data12, data13, data14, data15, data16, data2, data3, data4, data5, data6, data7, data8, data9    
  Remote query: SELECT ctid, info FROM t_regexp WHERE (reverse(info) ~ '^1e346e5efc7703f11495'::text)    
    ctid    |               info                   
------------+----------------------------------    
 (19928,46) | cc302768b6a459411f3077cfe5e643e1    
(1 row)    
Time: 6.079 ms

Find below the suffix fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495123';    
LOG:  duration: 4.157 ms  plan:    
Query Text: select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495123';    
Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0) (actual time=4.154..4.154 rows=0 loops=1)    
  Output: t_regexp.ctid, t_regexp.info    
  Node/s: data1, data10, data11, data12, data13, data14, data15, data16, data2, data3, data4, data5, data6, data7, data8, data9    
  Remote query: SELECT ctid, info FROM t_regexp WHERE (reverse(info) ~ '^1e346e5efc7703f11495123'::text)    
 ctid | info     
------+------    
(0 rows)    
Time: 4.930 ms

Find below the fully fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c949' order by info <-> '6ccd8ca827b0526cd57a71c949' limit 10;    
LOG:  duration: 3066.193 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c949' order by info <-> '6ccd8ca827b0526cd57a71c949' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39) (actual time=3066.175..3066.176 rows=1 loops=1)    
  Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c949'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39) (actual time=3066.173..3066.174 rows=1 loops=1)    
        Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c949'::text)    
    ctid      |               info                   
--------------+----------------------------------    
 (459019,106) | 8536ccd8ca827b0526cd57a71c949ab2    
(1 row)    
Time: 3074.907 ms

Find below the fully fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
LOG:  duration: 3099.348 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39) (actual time=3099.341..3099.341 rows=0 loops=1)    
  Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39) (actual time=3099.339..3099.339 rows=0 loops=1)    
        Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)    
 ctid | info     
------+------    
(0 rows)    
Time: 3108.698 ms

The GIN index is used.

digoal=# explain (verbose) select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
LOG:  duration: 0.000 ms  plan:    
Query Text: explain (verbose) select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39)    
  Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39)    
        Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)    
        ->  Limit  (cost=72.03..72.03 rows=1 width=39)    
              Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))    
              ->  Sort  (cost=72.03..72.03 rows=1 width=39)    
                    Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))    
                    Sort Key: ((t_regexp.info <-> '6ccd8ca827b0526cd57a71c123'::text))    
                    ->  Bitmap Heap Scan on public.t_regexp  (cost=68.00..72.02 rows=1 width=39)    
                          Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)    
                          Recheck Cond: (t_regexp.info ~ '6ccd8ca827b0526cd57a71c123'::text)    
                          ->  Bitmap Index Scan on idx_t_regexp_gin  (cost=0.00..68.00 rows=1 width=0)    
                                Index Cond: (t_regexp.info ~ '6ccd8ca827b0526cd57a71c123'::text)

Find below the regex match performance with results (for example, 0 is confused with 8 and B, and 6 is confused with b).

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' limit 10;    
LOG:  duration: 3181.592 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39) (actual time=3181.568..3181.569 rows=2 loops=1)    
  Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39) (actual time=3181.566..3181.566 rows=2 loops=1)    
        Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb'::text)    
     ctid     |               info                   
--------------+----------------------------------    
 (260416,110) | b44c9616bfa86a722daa9596f86ebefb    
 (504007,4)   | b44c9616bfa86a722daa9596f86ebefb    
(2 rows)    
Time: 3189.666 ms

Consider another example, if you forget a few characters in the middle, and only know that they might be 2 to 5 numbers.

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa[\d]{2,5}6f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb' limit 10;    
LOG:  duration: 3249.156 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa[\d]{2,5}6f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39) (actual time=3249.136..3249.137 rows=2 loops=1)    
  Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39) (actual time=3249.134..3249.134 rows=2 loops=1)    
        Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb'::text)    
     ctid     |               info                   
--------------+----------------------------------    
 (260416,110) | b44c9616bfa86a722daa9596f86ebefb    
 (504007,4)   | b44c9616bfa86a722daa9596f86ebefb    
(2 rows)    
Time: 3257.435 ms

Find below the performance without results.

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' limit 10;    
LOG:  duration: 3086.572 ms  plan:    
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' limit 10;    
Limit  (cost=72.03..72.03 rows=1 width=39) (actual time=3086.567..3086.567 rows=0 loops=1)    
  Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12'::text))    
  ->  Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9)  (cost=72.03..72.03 rows=1 width=39) (actual time=3086.565..3086.565 rows=0 loops=1)    
        Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12'::text)    
 ctid | info     
------+------    
(0 rows)    
Time: 3094.683 ms

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds

Background

Fuzzy Search Scenarios and Optimization Methods

How to Optimize Different Fuzzy Search Scenarios: Examples

1. Prefix Fuzzy Search

2. Suffix Fuzzy Search

3. Fully Fuzzy Search

4. Regex Search

Performance of High Concurrency Stress Testing

Performance Test For Large Data Volume

Performance Test of Regex search and fuzzy search in a Single Table with 14.4 Billion Data Records

Performance Comparison Chart

Additional Information

Test for 32-bit Strings

Original Source

PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds

This article illustrates different optimization methods to accelerate performance under distinct fuzzy search scenarios…

Written by Alibaba Cloud