PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds

By Digoal

Background

Usually, Regex match and fuzzy match are the specialties of search engines. However, if you use a PostgreSQL database, it is still possible to implement these functions with good performance. In addition, coupled with distributed solutions (such as plproxy, pg_shard, fdw shard, pg-xc, pg-xl, greenplum), PostgreSQL database helps to effectively perform regular matches and fuzzy matches for more than ten billion data records without losing the inherent functions of the database.

The Internet of Things (IoT) generates a large amount of data, which includes not only digital data, but also string-type data, such as bar codes, license plates, mobile phone numbers, emails, and names. So, what is an efficient method to perform fuzzy searches or even regex matches on a large amount of sensor data?

There are several such scenarios, for example, a batch of drugs in the market may have problems. Then a regex search needs to be performed for drug barcodes to find out the flow direction of the eligible drugs.

Another example is to perform a search for clues during an investigation. The clues may include incomplete phone numbers, email addresses, license plates, IP addresses, QQ numbers, or WeChat numbers provided by users. According to available information along with the superposition of time, a fuzzy match and correlation can help to track the culprit.

Fuzzy Search Scenarios and Optimization Methods

To begin with, let’s take a look at the application scenarios classification and the optimization methods available in existing technologies.

1. Prefix Fuzzy Search: For example, LIKE ‘ABC%’, which can also be written as ~ ‘^ABC’ in PG.

2. Suffix Fuzzy Search: For example, LIKE ‘%ABC’, which can also be written as ~ ‘ABC$’ in PG.

3. Fully Fuzzy Search: For example, LIKE ‘%AB_C%’, which can also be written as ~ ‘AB.C’ in PG.

4. Regex Search: For example, ~ ‘[d]+def1.?[ a|b|0|8]{1,3}’

For all the four scenarios, you can use the GIN index for pg_trgm, or the multi-column index used for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)) for optimization.

The pg_trgm plug-in begins to support the use of indexes in fuzzy searches since PostgreSQL 9.1, and the use of indexes in regex searches since PostgreSQL 9.3. It greatly improves the ability of PostgreSQL in a criminal investigation.

For the code, you can refer to this document

The principle of the pg_trgm plug-in is to add 2 spaces before the string and 1 space after it to form a new string, and to split the new string into multiple token according to every 3 adjacent characters.

When a regular expression or fuzzy search is used for matching, the similarity of the two character strings will be retrieved before the filtering is done.

Example of the GIN index

When a matched token is retrieved from the B-tree, it points to the corresponding list and finds the corresponding record from the CTIDs stored in the list. Then, a string is split into many tokens, so that each time a record is inserted, multiple indexes are updated. This necessitates the GIN Fast Update.

So, if you are wondering how does the regex match work, refer to this document for more details.

In fact, it converts the regular expression to the NFA format and then scans multiple tokens for “bit and|or” matching. However, a large number of rechecks are required if many “bit and|or” results are obtained after conversion, which negatively impacts performance.

Let’s find out how to optimize the various fuzzy search scenarios with the help of examples in the following sections.

How to Optimize Different Fuzzy Search Scenarios: Examples

1. Prefix Fuzzy Search

Consider an example where, LIKE ‘ABC%’, can also be written as ~ ‘^ABC’ in PG. To optimize this you can use the B-tree index, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)). For example, following is the randomly generated first 8 characters of 10 million pieces of MD5 data.

postgres=# create table tb(info text);    
CREATE TABLE
postgres=# insert into tb select substring(md5(random()::text),1,8) from generate_series(1,10000000);
INSERT 0 10000000
postgres=# create index idx_tb on tb(info);
CREATE INDEX
postgres=# select * from tb limit 1;
info
----------
376821ab
(1 row)
postgres=# explain select * from tb where info ~ '^376821' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------
Limit (cost=0.43..0.52 rows=10 width=9)
-> Index Only Scan using idx_tb on tb (cost=0.43..8.46 rows=1000 width=9)
Index Cond: ((info >= '376821'::text) AND (info < '376822'::text))
Filter: (info ~ '^376821'::text)
(4 rows)
postgres=# select * from tb where info ~ '^376821' limit 10;
info
----------
376821ab
(1 row)
Time: 0.536 ms
postgres=# set enable_indexscan=off;
SET
Time: 1.344 ms
postgres=# set enable_bitmapscan=off;
SET
Time: 0.158 ms
postgres=# explain select * from tb where info ~ '^376821' limit 10;
QUERY PLAN
----------------------------------------------------------------
Limit (cost=0.00..1790.55 rows=10 width=9)
-> Seq Scan on tb (cost=0.00..179055.00 rows=1000 width=9)
Filter: (info ~ '^376821'::text)
(3 rows)
Time: 0.505 ms

For prefix fuzzy searches, it takes 5483 milliseconds if no index is used and 0.5 milliseconds if an index is used.

postgres=# select * from tb where info ~ '^376821' limit 10;    
info
----------
376821ab
(1 row)
Time: 5483.655 ms

2. Suffix Fuzzy Search

Consider an example where LIKE ‘%ABC’, can also be written as ~ ‘ABC$’ in PG. To optimize this scenario, you can use the B-tree index for reverse function, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

postgres=# create index idx_tb1 on tb(reverse(info));    
CREATE INDEX
postgres=# explain select * from tb where reverse(info) ~ '^ba128' limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit (cost=0.43..28.19 rows=10 width=9)
-> Index Scan using idx_tb1 on tb (cost=0.43..138778.43 rows=50000 width=9)
Index Cond: ((reverse(info) >= 'ba128'::text) AND (reverse(info) < 'ba129'::text))
Filter: (reverse(info) ~ '^ba128'::text)
(4 rows)

postgres=# select * from tb where reverse(info) ~ '^ba128' limit 10;
info
----------
220821ab
671821ab
305821ab
e65821ab
536821ab
376821ab
668821ab
4d8821ab
26c821ab
(9 rows)
Time: 0.506 ms

For fuzzy suffix searches, it only takes 0.5 milliseconds when an index is used.

3. Fully Fuzzy Search

Consider an example where LIKE ‘%AB_C%’, can also be written as ~ ‘AB.C’ in PG. To optimize this scenario, you can use the GIN index for pg_trgm, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

postgres=# create extension pg_trgm;    
postgres=# explain select * from tb where info ~ '5821a';
QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on tb (cost=103.75..3677.71 rows=1000 width=9)
Recheck Cond: (info ~ '5821a'::text)
-> Bitmap Index Scan on idx_tb_2 (cost=0.00..103.50 rows=1000 width=0)
Index Cond: (info ~ '5821a'::text)
(4 rows)
Time: 0.647 ms

postgres=# select * from tb where info ~ '5821a';
info
----------
5821a8a3
945821af
45821a74
9fe5821a
5821a7e0
5821af2a
1075821a
e5821ac9
d265821a
45f5821a
df5821a4
de5821af
71c5821a
375821a3
fc5821af
5c5821ad
e65821ab
5821adde
c35821a6
5821a642
305821ab
5821a1c8
75821a5c
ce95821a
a65821ad
(25 rows)
Time: 3.808 ms

For fully fuzzy searches, it only takes 3.8 milliseconds when an index is used.

4. Regex Search

Consider an example, ~ ‘[d]+def1.?[ a|b|0|8]{1,3}’. To optimize search performance in this case, you can use the GIN index for pg_trgm, or the multi-column index for column splitting plus “bit and”/”bit or” (this is only suitable for fixed-length end strings, such as char(8)).

For fully fuzzy searches, it only takes 108 milliseconds if an index is used.

postgres=# select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;    
info
----------
4e6567ab
1e6530ab
e6500ab8
ae6583ab
e6564ab7
5e6532ab
e6526abf
e6560ab6
(8 rows)
Time: 108.577 ms

It is important to note that time is mainly spent on exclusion operations.

14794 rows are retrieved, and 14793 rows are removed. A lot of time is spent on useless work, but it is still far better than the full-table scan.

postgres=# explain (verbose,analyze,buffers,costs,timing) select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;    
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=511.75..547.49 rows=10 width=9) (actual time=89.934..120.567 rows=1 loops=1)
Output: info
Buffers: shared hit=13054
-> Bitmap Heap Scan on public.tb (cost=511.75..4085.71 rows=1000 width=9) (actual time=89.930..120.562 rows=1 loops=1)
Output: info
Recheck Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)
Rows Removed by Index Recheck: 14793
Heap Blocks: exact=12929
Buffers: shared hit=13054
-> Bitmap Index Scan on idx_tb_2 (cost=0.00..511.50 rows=1000 width=0) (actual time=67.589..67.589 rows=14794 loops=1)
Index Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)
Buffers: shared hit=125
Planning time: 0.493 ms
Execution time: 120.618 ms
(14 rows)
Time: 124.693 ms

Performance Optimization

You need to consider performance issues when the GIN index is used because the info field is split into multiple char(3) tokens and a lot of index entries are involved. If the concurrency of insertion is high, it is best to set gin_pending_list_limit to a large value to improve insertion efficiency and reduce the RT increase caused by merging indexes in real-time.

After Fast Update is used, the pending information is automatically merged into the GIN index each time VACUUM operation is performed on the table. In addition, no merging action exists in the query. The GIN information that is not merged is searched through traversal.

Performance of High Concurrency Stress Testing

create table tbl(id serial8, crt_time timestamp, sensorid int, sensorloc point, info text) with (autovacuum_enabled=on, autovacuum_vacuum_threshold=0.000001,autovacuum_vacuum_cost_delay=0);    
CREATE INDEX trgm_idx ON tbl USING GIN (info gin_trgm_ops) with (fastupdate='on', gin_pending_list_limit='6553600');
alter sequence tbl_id_seq cache 10000;

Modify the configuration, to ensure that the database autovacuum merges GIN quickly and iteratively.

vi $PGDATA/postgresql.conf    
autovacuum_naptime=30s
maintenance_work_mem=1GB
autovacuum_work_mem=1GB
autovacuum = on
autovacuum_max_workers = 3
log_autovacuum_min_duration = 0
autovacuum_vacuum_cost_delay=0

$ pg_ctl reload

Create a test function to generate random test data.

postgres=# create or replace function f() returns void as $$    
insert into tbl (crt_time,sensorid,info) values ( clock_timestamp(),trunc(random()*500000),substring(md5(random()::text),1,8) );
$$ language sql strict;
vi test.sql
select f();

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000

progress: 50.0 s, 52800.9 tps, lat 0.453 ms stddev 0.390
progress: 51.0 s, 52775.8 tps, lat 0.453 ms stddev 0.398
progress: 52.0 s, 53173.2 tps, lat 0.449 ms stddev 0.371
progress: 53.0 s, 53010.0 tps, lat 0.451 ms stddev 0.390
progress: 54.0 s, 53360.9 tps, lat 0.448 ms stddev 0.365
progress: 55.0 s, 53285.0 tps, lat 0.449 ms stddev 0.362
progress: 56.0 s, 53662.1 tps, lat 0.445 ms stddev 0.368
progress: 57.0 s, 53283.8 tps, lat 0.448 ms stddev 0.385
progress: 58.0 s, 53703.4 tps, lat 0.445 ms stddev 0.355
progress: 59.0 s, 53818.7 tps, lat 0.444 ms stddev 0.344
progress: 60.0 s, 53889.2 tps, lat 0.443 ms stddev 0.361
progress: 61.0 s, 53613.8 tps, lat 0.446 ms stddev 0.355
progress: 62.0 s, 53339.9 tps, lat 0.448 ms stddev 0.392
progress: 63.0 s, 54014.9 tps, lat 0.442 ms stddev 0.346
progress: 64.0 s, 53112.1 tps, lat 0.450 ms stddev 0.374
progress: 65.0 s, 53706.1 tps, lat 0.445 ms stddev 0.367
progress: 66.0 s, 53720.9 tps, lat 0.445 ms stddev 0.353
progress: 67.0 s, 52858.1 tps, lat 0.452 ms stddev 0.415
progress: 68.0 s, 53218.9 tps, lat 0.449 ms stddev 0.387
progress: 69.0 s, 53403.0 tps, lat 0.447 ms stddev 0.377
progress: 70.0 s, 53179.9 tps, lat 0.449 ms stddev 0.377
progress: 71.0 s, 53232.4 tps, lat 0.449 ms stddev 0.373
progress: 72.0 s, 53011.7 tps, lat 0.451 ms stddev 0.386
progress: 73.0 s, 52685.1 tps, lat 0.454 ms stddev 0.384
progress: 74.0 s, 52937.8 tps, lat 0.452 ms stddev 0.377

At this speed, you can store more than 4 billion data records in the database each day.

Next, let’s see the example of splitting strings. This example applies to scenarios where the string length is fixed and small. If the string length is not fixed, this method is ineffective.

Although the split method is applicable, the test data is unsatisfactory. Therefore, it is more reliable to use pg_trgm instead.

postgres=# create table t_split(id int, crt_time timestamp, sensorid int, sensorloc point, info text, c1 char(1), c2 char(1), c3 char(1), c4 char(1), c5 char(1), c6 char(1), c7 char(1), c8 char(1));    
CREATE TABLE
Time: 2.123 ms

postgres=# insert into t_split(id,crt_time,sensorid,info,c1,c2,c3,c4,c5,c6,c7,c8) select id,ct,sen,info,substring(info,1,1),substring(info,2,1),substring(info,3,1),substring(info,4,1),substring(info,5,1),substring(info,6,1),substring(info,7,1),substring(info,8,1) from (select id, clock_timestamp() ct, trunc(random()*500000) sen, substring(md5(random()::text), 1, 8) info from generate_series(1,10000000) t(id)) t;
INSERT 0 10000000
Time: 81829.274 ms

postgres=# create index idx1 on t_split (c1);
postgres=# create index idx2 on t_split (c2);
postgres=# create index idx3 on t_split (c3);
postgres=# create index idx4 on t_split (c4);
postgres=# create index idx5 on t_split (c5);
postgres=# create index idx6 on t_split (c6);
postgres=# create index idx7 on t_split (c7);
postgres=# create index idx8 on t_split (c8);
postgres=# create index idx9 on t_split using gin (info gin_trgm_ops);

postgres=# select * from t_split limit 1;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
(1 row)

postgres=# select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
(1 row)
Time: 133.041 ms
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=575.75..612.78 rows=10 width=57) (actual time=92.406..129.838 rows=1 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Buffers: shared hit=13798
-> Bitmap Heap Scan on public.t_split (cost=575.75..4278.56 rows=1000 width=57) (actual time=92.403..129.833 rows=1 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Recheck Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)
Rows Removed by Index Recheck: 14690
Heap Blocks: exact=13669
Buffers: shared hit=13798
-> Bitmap Index Scan on idx9 (cost=0.00..575.50 rows=1000 width=0) (actual time=89.576..89.576 rows=14691 loops=1)
Index Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)
Buffers: shared hit=129
Planning time: 0.385 ms
Execution time: 129.883 ms
(14 rows)

Time: 130.678 ms


postgres=# select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
(1 row)

Time: 337.367 ms

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.230..344.675 rows=1 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Buffers: shared hit=7581
-> Bitmap Heap Scan on public.t_split (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.228..344.673 rows=1 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Recheck Cond: ((t_split.c3 = 'e'::bpchar) AND (t_split.c8 = '9'::bpchar) AND (t_split.c5 = 'd'::bpchar))
Filter: ((t_split.c2 >= '0'::bpchar) AND (t_split.c2 <= '9'::bpchar) AND (t_split.c6 >= '0'::bpchar) AND (t_split.c6 <= '9'::bpchar) AND (t_split.c1 = '3'::bpchar) AND (t_split.c4 = 'e'::bpchar) AND (t_split.c7 = '7'::bpchar))
Rows Removed by Filter: 2480
Heap Blocks: exact=2450
Buffers: shared hit=7581
-> BitmapAnd (cost=33582.31..33582.31 rows=2224 width=0) (actual time=338.512..338.512 rows=0 loops=1)
Buffers: shared hit=5131
-> Bitmap Index Scan on idx3 (cost=0.00..11016.93 rows=596333 width=0) (actual time=104.418..104.418 rows=624930 loops=1)
Index Cond: (t_split.c3 = 'e'::bpchar)
Buffers: shared hit=1711
-> Bitmap Index Scan on idx8 (cost=0.00..11245.44 rows=608667 width=0) (actual time=100.185..100.185 rows=625739 loops=1)
Index Cond: (t_split.c8 = '9'::bpchar)
Buffers: shared hit=1712
-> Bitmap Index Scan on idx5 (cost=0.00..11319.44 rows=612667 width=0) (actual time=99.480..99.480 rows=624269 loops=1)
Index Cond: (t_split.c5 = 'd'::bpchar)
Buffers: shared hit=1708
Planning time: 0.262 ms
Execution time: 344.731 ms
(23 rows)

Time: 346.424 ms

postgres=# select * from t_split where info ~ '^33.+7.+9$' limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
24412 | 2016-03-02 09:58:04.186359 | 251599 | | 33f07429 | 3 | 3 | f | 0 | 7 | 4 | 2 | 9
24989 | 2016-03-02 09:58:04.191112 | 214569 | | 334587d9 | 3 | 3 | 4 | 5 | 8 | 7 | d | 9
50100 | 2016-03-02 09:58:04.398499 | 409819 | | 33beb7b9 | 3 | 3 | b | e | b | 7 | b | 9
92623 | 2016-03-02 09:58:04.745372 | 280100 | | 3373e719 | 3 | 3 | 7 | 3 | e | 7 | 1 | 9
106054 | 2016-03-02 09:58:04.855627 | 155192 | | 33c575c9 | 3 | 3 | c | 5 | 7 | 5 | c | 9
107070 | 2016-03-02 09:58:04.863827 | 464325 | | 337dd729 | 3 | 3 | 7 | d | d | 7 | 2 | 9
135152 | 2016-03-02 09:58:05.088217 | 240500 | | 336271d9 | 3 | 3 | 6 | 2 | 7 | 1 | d | 9
156425 | 2016-03-02 09:58:05.25805 | 218202 | | 333e7289 | 3 | 3 | 3 | e | 7 | 2 | 8 | 9
170210 | 2016-03-02 09:58:05.368371 | 132530 | | 33a8d789 | 3 | 3 | a | 8 | d | 7 | 8 | 9
(10 rows)

Time: 20.431 ms

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^33.+7.+9$' limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=43.75..80.78 rows=10 width=57) (actual time=19.573..21.212 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Buffers: shared hit=566
-> Bitmap Heap Scan on public.t_split (cost=43.75..3746.56 rows=1000 width=57) (actual time=19.571..21.206 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Recheck Cond: (t_split.info ~ '^33.+7.+9$'::text)
Rows Removed by Index Recheck: 647
Heap Blocks: exact=552
Buffers: shared hit=566
-> Bitmap Index Scan on idx9 (cost=0.00..43.50 rows=1000 width=0) (actual time=11.712..11.712 rows=39436 loops=1)
Index Cond: (t_split.info ~ '^33.+7.+9$'::text)
Buffers: shared hit=14
Planning time: 0.301 ms
Execution time: 21.255 ms
(14 rows)

Time: 21.995 ms


postgres=# select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
24412 | 2016-03-02 09:58:04.186359 | 251599 | | 33f07429 | 3 | 3 | f | 0 | 7 | 4 | 2 | 9
24989 | 2016-03-02 09:58:04.191112 | 214569 | | 334587d9 | 3 | 3 | 4 | 5 | 8 | 7 | d | 9
50100 | 2016-03-02 09:58:04.398499 | 409819 | | 33beb7b9 | 3 | 3 | b | e | b | 7 | b | 9
92623 | 2016-03-02 09:58:04.745372 | 280100 | | 3373e719 | 3 | 3 | 7 | 3 | e | 7 | 1 | 9
106054 | 2016-03-02 09:58:04.855627 | 155192 | | 33c575c9 | 3 | 3 | c | 5 | 7 | 5 | c | 9
107070 | 2016-03-02 09:58:04.863827 | 464325 | | 337dd729 | 3 | 3 | 7 | d | d | 7 | 2 | 9
135152 | 2016-03-02 09:58:05.088217 | 240500 | | 336271d9 | 3 | 3 | 6 | 2 | 7 | 1 | d | 9
156425 | 2016-03-02 09:58:05.25805 | 218202 | | 333e7289 | 3 | 3 | 3 | e | 7 | 2 | 8 | 9
170210 | 2016-03-02 09:58:05.368371 | 132530 | | 33a8d789 | 3 | 3 | a | 8 | d | 7 | 8 | 9
(10 rows)

Time: 37.739 ms

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..8135.78 rows=10 width=57) (actual time=0.017..35.532 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Buffers: shared hit=1755
-> Seq Scan on public.t_split (cost=0.00..353093.00 rows=434 width=57) (actual time=0.015..35.526 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Filter: ((t_split.c1 = '3'::bpchar) AND (t_split.c2 = '3'::bpchar) AND (t_split.c8 = '9'::bpchar) AND ((t_split.c4 = '7'::bpchar) OR (t_split.c5 = '7'::bpchar) OR (t_split.c6 = '7'::bpchar)))
Rows Removed by Filter: 170200
Buffers: shared hit=1755
Planning time: 0.210 ms
Execution time: 35.572 ms
(10 rows)

Time: 36.260 ms

postgres=# select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
---------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
1308724 | 2016-03-02 09:58:14.590901 | 458822 | | 3fed9479 | 3 | f | e | d | 9 | 4 | 7 | 9
2866024 | 2016-03-02 09:58:27.20105 | 106467 | | 3fed2279 | 3 | f | e | d | 2 | 2 | 7 | 9
4826729 | 2016-03-02 09:58:42.907431 | 228023 | | 3ded9879 | 3 | d | e | d | 9 | 8 | 7 | 9
6113373 | 2016-03-02 09:58:53.211146 | 499702 | | 36fed479 | 3 | 6 | f | e | d | 4 | 7 | 9
1768237 | 2016-03-02 09:58:18.310069 | 345027 | | 30fed079 | 3 | 0 | f | e | d | 0 | 7 | 9
1472324 | 2016-03-02 09:58:15.913629 | 413283 | | 3eed5798 | 3 | e | e | d | 5 | 7 | 9 | 8
8319056 | 2016-03-02 09:59:10.902137 | 336740 | | 3ded7790 | 3 | d | e | d | 7 | 7 | 9 | 0
8576573 | 2016-03-02 09:59:12.962923 | 130223 | | 3eed5793 | 3 | e | e | d | 5 | 7 | 9 | 3
(9 rows)

Time: 268.661 ms

postgres=# explain (analyze,verbose,timing,buffers,costs) select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4302.66..4302.69 rows=10 width=57) (actual time=269.214..269.217 rows=9 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))
Buffers: shared hit=52606
-> Sort (cost=4302.66..4305.16 rows=1000 width=57) (actual time=269.212..269.212 rows=9 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))
Sort Key: ((t_split.info <-> '^3.?[b-g]+ed[\d]+79'::text))
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=52606
-> Bitmap Heap Scan on public.t_split (cost=575.75..4281.06 rows=1000 width=57) (actual time=100.771..269.180 rows=9 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, (info <-> '^3.?[b-g]+ed[\d]+79'::text)
Recheck Cond: (t_split.info ~ '^3.?[b-g]+ed[\d]+79'::text)
Rows Removed by Index Recheck: 72929
Heap Blocks: exact=52479
Buffers: shared hit=52606
-> Bitmap Index Scan on idx9 (cost=0.00..575.50 rows=1000 width=0) (actual time=88.062..88.062 rows=72938 loops=1)
Index Cond: (t_split.info ~ '^3.?[b-g]+ed[\d]+79'::text)
Buffers: shared hit=127
Planning time: 0.640 ms
Execution time: 269.287 ms
(19 rows)

Time: 270.430 ms

postgres=# select * from t_split where info ~ '3.?[b-g]+ed[\d]+79' order by info <-> '3.?[b-g]+ed[\d]+79' limit 10;
id | crt_time | sensorid | sensorloc | info | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8
---------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----
1 | 2016-03-02 09:58:03.990639 | 161958 | | 33eed779 | 3 | 3 | e | e | d | 7 | 7 | 9
1308724 | 2016-03-02 09:58:14.590901 | 458822 | | 3fed9479 | 3 | f | e | d | 9 | 4 | 7 | 9
4826729 | 2016-03-02 09:58:42.907431 | 228023 | | 3ded9879 | 3 | d | e | d | 9 | 8 | 7 | 9
5250603 | 2016-03-02 09:58:46.300289 | 250582 | | d3eed179 | d | 3 | e | e | d | 1 | 7 | 9
6113373 | 2016-03-02 09:58:53.211146 | 499702 | | 36fed479 | 3 | 6 | f | e | d | 4 | 7 | 9
1768237 | 2016-03-02 09:58:18.310069 | 345027 | | 30fed079 | 3 | 0 | f | e | d | 0 | 7 | 9
2866024 | 2016-03-02 09:58:27.20105 | 106467 | | 3fed2279 | 3 | f | e | d | 2 | 2 | 7 | 9
1472324 | 2016-03-02 09:58:15.913629 | 413283 | | 3eed5798 | 3 | e | e | d | 5 | 7 | 9 | 8
8576573 | 2016-03-02 09:59:12.962923 | 130223 | | 3eed5793 | 3 | e | e | d | 5 | 7 | 9 | 3
8319056 | 2016-03-02 09:59:10.902137 | 336740 | | 3ded7790 | 3 | d | e | d | 7 | 7 | 9 | 0
(10 rows)

Time: 320.525 ms

postgres=# explain (analyze,verbose,buffers,costs,timing) select * from t_split where info ~ '3.?[b-g]+ed[\d]+79' order by info <-> '3.?[b-g]+ed[\d]+79' limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4302.66..4302.69 rows=10 width=57) (actual time=319.925..319.927 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '3.?[b-g]+ed[\d]+79'::text))
Buffers: shared hit=52602
-> Sort (cost=4302.66..4305.16 rows=1000 width=57) (actual time=319.923..319.923 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '3.?[b-g]+ed[\d]+79'::text))
Sort Key: ((t_split.info <-> '3.?[b-g]+ed[\d]+79'::text))
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=52602
-> Bitmap Heap Scan on public.t_split (cost=575.75..4281.06 rows=1000 width=57) (actual time=104.526..319.885 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, (info <-> '3.?[b-g]+ed[\d]+79'::text)
Recheck Cond: (t_split.info ~ '3.?[b-g]+ed[\d]+79'::text)
Rows Removed by Index Recheck: 72928
Heap Blocks: exact=52479
Buffers: shared hit=52602
-> Bitmap Index Scan on idx9 (cost=0.00..575.50 rows=1000 width=0) (actual time=91.945..91.945 rows=72938 loops=1)
Index Cond: (t_split.info ~ '3.?[b-g]+ed[\d]+79'::text)
Buffers: shared hit=123
Planning time: 0.948 ms
Execution time: 320.003 ms
(19 rows)

Time: 321.502 ms

Performance Test For Large Data Volume

Simulate a partition table, with one partition per hour. The data volume is 50 million per hour, 1.2 billion per day, and 36 billion per month.

drop table tbl cascade;    
create table tbl (id serial8, crt_time timestamp, sensorid int, sensorloc point, info text);

do language plpgsql $$
declare
v_s timestamp := '2016-01-01 00:00:00';
begin
for i in 1..720 loop
execute 'create table tbl_'||to_char(v_s,'yyyymmddhh24')||' (id int8 not null default nextval(''tbl_id_seq''::regclass), crt_time timestamp check (crt_time >= '''||to_char(v_s,'yyyy-mm-dd hh24:mi:ss')||''' and crt_time <'''||to_char(v_s+'1 h'::interval,'yyyy-mm-dd hh24:mi:ss')||'''), sensorid int, sensorloc point, info text) inherits (tbl)';
v_s := v_s + '1 h'::interval;
end loop;
end;
$$;

alter sequence tbl_id_seq cache 100000;

Generate INSERT SQL.

do language plpgsql $$    
declare
v_s timestamp := '2016-01-01 00:00:00';
begin
for i in 1..720 loop
raise notice '%', 'psql -c "insert into tbl_'||to_char(v_s,'yyyymmddhh24')||' (crt_time, sensorid, info) select '''||to_char(v_s,'yyyy-mm-dd hh24:mi:ss')||''',trunc(random()*500000), substring(md5(random()::text),1,8) from generate_series(1,50000000);" &';
v_s := v_s + '1 h'::interval;
end loop;
end;
$$;

Performance metric: For 50 million data records in a single table, the range scan returns the results in milliseconds.

postgres=# explain (analyze,verbose,timing,buffers,costs) select * from tbl where crt_time between '2016-01-01 12:00:00' and '2016-01-01 12:30:00' and info ~ 'f[\d]{2}e27e0$' order by info <-> 'f[\d]{2}e27e0$' limit 10;    
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18918.83..18918.85 rows=10 width=45) (actual time=350.296..350.297 rows=2 loops=1)
Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, ((tbl.info <-> 'f[\d]{2}e27e0$'::text))
Buffers: shared hit=4530
-> Sort (cost=18918.83..18931.33 rows=5001 width=45) (actual time=350.294..350.295 rows=2 loops=1)
Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, ((tbl.info <-> 'f[\d]{2}e27e0$'::text))
Sort Key: ((tbl.info <-> 'f[\d]{2}e27e0$'::text))
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4530
-> Result (cost=0.00..18810.76 rows=5001 width=45) (actual time=347.995..350.279 rows=2 loops=1)
Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info, (tbl.info <-> 'f[\d]{2}e27e0$'::text)
Buffers: shared hit=4530
-> Append (cost=0.00..18798.26 rows=5001 width=45) (actual time=347.976..350.254 rows=2 loops=1)
Buffers: shared hit=4530
-> Seq Scan on public.tbl (cost=0.00..0.00 rows=1 width=68) (actual time=0.001..0.001 rows=0 loops=1)
Output: tbl.id, tbl.crt_time, tbl.sensorid, tbl.sensorloc, tbl.info
Filter: ((tbl.crt_time >= '2016-01-01 12:00:00'::timestamp without time zone) AND (tbl.crt_time <= '2016-01-01 12:30:00'::timestamp without time zone) AND (tbl.info ~ 'f[\d]{2}e27e0$'::text))
-> Bitmap Heap Scan on public.tbl_2016010112 (cost=574.75..18798.26 rows=5000 width=45) (actual time=347.972..350.249 rows=2 loops=1)
Output: tbl_2016010112.id, tbl_2016010112.crt_time, tbl_2016010112.sensorid, tbl_2016010112.sensorloc, tbl_2016010112.info
Recheck Cond: (tbl_2016010112.info ~ 'f[\d]{2}e27e0$'::text)
Rows Removed by Index Recheck: 4100
Filter: ((tbl_2016010112.crt_time >= '2016-01-01 12:00:00'::timestamp without time zone) AND (tbl_2016010112.crt_time <= '2016-01-01 12:30:00'::timestamp without time zone))
Heap Blocks: exact=4085
Buffers: shared hit=4530
-> Bitmap Index Scan on idx_tbl_2016010112 (cost=0.00..573.50 rows=5000 width=0) (actual time=337.125..337.125 rows=4102 loops=1)
Index Cond: (tbl_2016010112.info ~ 'f[\d]{2}e27e0$'::text)
Buffers: shared hit=445
Planning time: 23.913 ms
Execution time: 350.383 ms
(28 rows)

postgres=# select * from tbl where crt_time between '2016-01-01 12:00:00' and '2016-01-01 12:30:00' and info ~ 'f[\d]{2}e27e0$' order by info <-> 'f[\d]{2}e27e0$' limit 10;
id | crt_time | sensorid | sensorloc | info
------------+---------------------+----------+-----------+----------
1982100172 | 2016-01-01 12:00:00 | 336772 | | f48e27e0
2292713691 | 2016-01-01 12:00:00 | 489110 | | f77e27e0
(2 rows)

Performance Test of Regex search and fuzzy search in a Single Table with 14.4 Billion Data Records

postgres=# \dt+ t_all    
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------+-------+----------+--------+-------------
public | t_all | table | postgres | 811 GB |
(1 row)

postgres=# \d t_all
Table "public.t_all"
Column | Type | Modifiers
-----------+-----------------------------+-----------
id | bigint | not null
crt_time | timestamp without time zone |
sensorid | integer |
sensorloc | point |
info | text |

postgres=# select count(*) from t_all;
count
-------------
14456717312
(1 row)

postgres=# select * from t_all limit 10;
id | crt_time | sensorid | sensorloc | info
------------+---------------------+----------+-----------+----------
6519272065 | 2016-01-06 10:00:00 | 493013 | | 62255c83
6519272066 | 2016-01-06 10:00:00 | 309676 | | f6c98800
6519272067 | 2016-01-06 10:00:00 | 43859 | | 125a1191
6519272068 | 2016-01-06 10:00:00 | 495624 | | e75cfd71
6519272069 | 2016-01-06 10:00:00 | 10362 | | 7171f11c
6519272070 | 2016-01-06 10:00:00 | 231476 | | 4201f809
6519272071 | 2016-01-06 10:00:00 | 43080 | | a47e84e5
6519272072 | 2016-01-06 10:00:00 | 131292 | | 17bc248e
6519272073 | 2016-01-06 10:00:00 | 486841 | | 3303097c
6519272074 | 2016-01-06 10:00:00 | 491503 | | f0c53fee
(10 rows)

The test data is subsequently released, and it is observed that there is no problem in returning the results in seconds after sharding the table.

There is high efficiency because the bottleneck is not on IO, but mainly on the rechecking the data. If the 14.4 billion data records are split into 29 tables (each table contains about 500 million data records) and parallelly executed, it is possible to obtain the results in seconds.

Let’s take a look at the test results for 500 million data records in a single table. The results are returned in seconds.

postgres=# explain (verbose,analyze,buffers,timing,costs) select * from tbl1 where info ~ 'aad.+f02' limit 10;  
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1439.79..1476.19 rows=10 width=29) (actual time=116.570..116.719 rows=10 loops=1)
Output: id, crt_time, sensorid, info
Buffers: shared hit=680
-> Bitmap Heap Scan on public.tbl1 (cost=1439.79..191054.88 rows=52103 width=29) (actual time=116.568..116.716 rows=10 loops=1)
Output: id, crt_time, sensorid, info
Recheck Cond: (tbl1.info ~ 'aad.+f02'::text)
Rows Removed by Index Recheck: 38
Heap Blocks: exact=48
Buffers: shared hit=680
-> Bitmap Index Scan on tbl1_info_idx (cost=0.00..1426.77 rows=52103 width=0) (actual time=116.495..116.495 rows=403 loops=1)
Index Cond: (tbl1.info ~ 'aad.+f02'::text)
Buffers: shared hit=632
Planning time: 0.311 ms
Execution time: 116.754 ms
(14 rows)

Time: 117.422 ms
postgres=# select * from tbl1 where info ~ 'aad.+f02' limit 10;
id | crt_time | sensorid | info
-----------+----------------------------+----------+----------
17986922 | 2016-02-29 17:42:42.427639 | 75863 | aad3f02a
19873247 | 2016-02-29 17:43:16.714945 | 174971 | 2aad5f02
23798336 | 2016-02-29 17:44:35.369654 | 202085 | aad06f02
28630866 | 2016-02-29 17:46:03.544462 | 463184 | baad3f02
31458529 | 2016-02-29 17:47:00.300937 | 411670 | aad1af02
52009687 | 2016-02-29 17:53:15.466246 | 192821 | 5aad6f02
80769909 | 2016-02-29 18:01:31.074248 | 47993 | aadcf029
80825896 | 2016-02-29 18:01:31.039063 | 284712 | aad14f02
83385996 | 2016-02-29 18:02:12.699317 | 78233 | daadcf02
102814316 | 2016-02-29 18:08:20.891412 | 359635 | aad06f02
(10 rows)

Time: 116.901 ms

The following shows the time required for the full-table scan.

postgres=# set enable_bitmapscan=off;  
SET
Time: 0.145 ms
postgres=# select * from tbl1 where info ~ 'aad.+f02' limit 10;
id | crt_time | sensorid | info
-----------+----------------------------+----------+----------
31458529 | 2016-02-29 17:47:00.300937 | 411670 | aad1af02
52009687 | 2016-02-29 17:53:15.466246 | 192821 | 5aad6f02
80769909 | 2016-02-29 18:01:31.074248 | 47993 | aadcf029
80825896 | 2016-02-29 18:01:31.039063 | 284712 | aad14f02
83385996 | 2016-02-29 18:02:12.699317 | 78233 | daadcf02
102814316 | 2016-02-29 18:08:20.891412 | 359635 | aad06f02
105236847 | 2016-02-29 18:09:56.914795 | 876 | aadbf026
108524272 | 2016-02-29 18:10:47.39312 | 338071 | 2aad2f02
128169786 | 2016-02-29 18:17:52.105948 | 262400 | aad0f028
135935810 | 2016-02-29 18:20:43.265139 | 487673 | aad7f021
(10 rows)
Time: 98903.073 ms

Performance Comparison Chart

The following figure shows the comparison in the case of 10 million data records.

The following figure shows the comparison in the case of 500 million data records.

The following figure shows the comparison between B-tree “bit or|and” and GIN in the case of 10 million pieces of data.

The following figure shows the comparison in the case of 14.4 billion data records of the partition table.

Optimization methods under large data volume: For example, how to achieve the second-level fuzzy query response for data volume above 10 billion records.

For a single machine, you can use partitions and parallel searches to make full use of the CPU function.

Alternatively, you can use the MPP and SHARDING architecture to utilize the resources of multiple machines.

Principle: To reduce the number of rechecks, try to scan and find the final desired result (a large number of scans and a large number of Remove Checked False Rows exist for full-table scan and indexes).

Additional Information

1. A brief description of the test host configuration

Common x86, 24-core, 1 xxg memory, and SSD.

2. Reference question from a friend

Digoal: Why do you use the first 8 characters of generated MD5 data? The minimum number of MD5 characters is 16. What is the mathematical reason behind this?

Reply: The result of MD5 computation is a 128-bit hex, and PostgreSQL outputs the result in the form of text, so what we see is a 32-bit string.

rfc1321  

3.1 Step 1. Append Padding Bits
3.2 Step 2. Append Length
3.3 Step 3. Initialize MD Buffer
3.4 Step 4. Process Message in 16-Word Blocks
3.5 Step 5. Output

4. Summary
The MD5 message-digest algorithm is simple to implement, and provides
a "fingerprint" or message digest of a message of arbitrary length.
It is conjectured that the difficulty of coming up with two messages
having the same message digest is on the order of 2^64 operations,
and that the difficulty of coming up with any message having a given
message digest is on the order of 2^128 operations. The MD5 algorithm
has been carefully scrutinized for weaknesses. It is, however, a
relatively new algorithm and further security analysis is of course
justified, as is the case with any new proposal of this sort.

PG code.

/*  
* Create an md5 hash of a text string and return it as hex
*
* md5 produces a 16 byte (128 bit) hash; double it for hex
*/
#define MD5_HASH_LEN 32

Datum
md5_text(PG_FUNCTION_ARGS)
{
text *in_text = PG_GETARG_TEXT_PP(0);
size_t len;
char hexsum[MD5_HASH_LEN + 1];

/* Calculate the length of the buffer using varlena metadata */
len = VARSIZE_ANY_EXHDR(in_text);

/* get the hash result */
if (pg_md5_hash(VARDATA_ANY(in_text), len, hexsum) == false)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));

/* convert to text and return it */
PG_RETURN_TEXT_P(cstring_to_text(hexsum));
}

/*
* pg_md5_hash
*
* Calculates the MD5 sum of the bytes in a buffer.
*
* SYNOPSIS #include "md5.h"
* int pg_md5_hash(const void *buff, size_t len, char *hexsum)
*
* INPUT buff the buffer containing the bytes that you want
* the MD5 sum of.
* len number of bytes in the buffer.
*
* OUTPUT hexsum the MD5 sum as a '\0'-terminated string of
* hexadecimal digits. an MD5 sum is 16 bytes long.
* each byte is represented by two heaxadecimal
* characters. you thus need to provide an array
* of 33 characters, including the trailing '\0'.
*
* RETURNS false on failure (out of memory for internal buffers) or
* true on success.
*
* STANDARDS MD5 is described in RFC 1321.
*
* AUTHOR Sverre H. Huseby <sverrehu@online.no>
*
*/
bool
pg_md5_hash(const void *buff, size_t len, char *hexsum)
{
uint8 sum[16];

if (!calculateDigestFromBuffer(buff, len, sum))
return false;

bytesToHex(sum, hexsum);
return true;
}

static int
calculateDigestFromBuffer(const uint8 *b, uint32 len, uint8 sum[16])
{
register uint32 i,
j,
k,
newI;
uint32 l;
uint8 *input;
register uint32 *wbp;
uint32 workBuff[16],
state[4];

l = len;

state[0] = 0x67452301;
state[1] = 0xEFCDAB89;
state[2] = 0x98BADCFE;
state[3] = 0x10325476;

if ((input = createPaddedCopyWithLength(b, &l)) == NULL)
return 0;

for (i = 0;;)
{
if ((newI = i + 16 * 4) > l)
break;
k = i + 3;
for (j = 0; j < 16; j++)
{
wbp = (workBuff + j);
*wbp = input[k--];
*wbp <<= 8;
*wbp |= input[k--];
*wbp <<= 8;
*wbp |= input[k--];
*wbp <<= 8;
*wbp |= input[k];
k += 7;
}
doTheRounds(workBuff, state);
i = newI;
}
free(input);

j = 0;
for (i = 0; i < 4; i++)
{
k = state[i];
sum[j++] = (k & 0xff);
k >>= 8;
sum[j++] = (k & 0xff);
k >>= 8;
sum[j++] = (k & 0xff);
k >>= 8;
sum[j++] = (k & 0xff);
}
return 1;
}


postgres=# select length(md5(random()::text)),md5(random()::text) from generate_series(1,10);
length | md5
--------+----------------------------------
32 | 4b0fcafd0f9d189cceade2812e4aa396
32 | 9169f49f0c059a65de3325a20068eb8e
32 | 7df4b60638972f1372fde91e3d2eee50
32 | 4fc6816aa88224163dda3e242f3e16f0
32 | 518870a2c8a9cfe6cba018916f5388a9
32 | b46b63c409532e5c973ddc27a5e49ce4
32 | e24e9d63926094046aa2b300dfa8e986
32 | 3f1ca7c5ce5159a0df17729f81e24925
32 | 5756f0925eca4c3801c4a49cf9b68023
32 | ed54795fcbe9491f4e5a00ec2cf323ee
(10 rows)

In addition, the purpose of intercepting the first 8 characters is to generate an 8-bit random string, which is a compromise in the length of data types, such as license plates, QQ numbers, and bar codes.

In fact, you can test a 16-bit or 32-bit random string. The test data for the corresponding length will be updated later.

Test for 32-bit Strings

Below is the performance chart for the regex search and fuzzy search in case of 32-bit strings of 1 billion records on a single machine.

Insert 1 billion data records, and store 32-bit long random strings for each record.

postgres=# create table t_regexp (info text);    
CREATE TABLE
postgres=# insert into t_regexp select md5(random()::text) from generate_series(1,1000000000);
INSERT 0 1000000000
postgres=# create index idx_t_regexp_gin on t_regexp using gin (info gin_trgm_ops);
CREATE INDEX
postgres=# create index idx_t_regexp_1 on t_regexp (info);
CREATE INDEX
postgres=# create index idx_t_regexp_2 on t_regexp (reverse(info));
CREATE INDEX

Use auto_explain to collect performance test data.

load 'auto_explain';      
set auto_explain.log_analyze =true;
set auto_explain.log_buffers =true;
set auto_explain.log_nested_statements=true;
set auto_explain.log_timing=true;
set auto_explain.log_triggers=true;
set auto_explain.log_verbose=true;
set auto_explain.log_min_duration=0;
set client_min_messages ='log';
set work_mem='2GB';

Test Data Overview: The total number of records is 1 billion, and the unique number of records is about 800 million.

digoal=> select count(*) from t_regexp ;    
count
------------
1000000000
(1 row)
Time: 18547.746 ms

digoal=> select count(*) from (select info from t_regexp group by info) t;
count
-----------
799444838
(1 row)

Find below the prefix fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7e' limit 10;    
LOG: duration: 0.351 ms plan:
Query Text: select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7e' limit 10;
Limit (cost=0.15..8.17 rows=1 width=39) (actual time=0.152..0.346 rows=2 loops=1)
Output: ctid, info
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=0.15..8.17 rows=1 width=39) (actual time=0.150..0.342 rows=2 loops=1)
Output: ctid, info
ctid | info
--------------+----------------------------------
(14720,1) | 9def5fe6343a5938b23af38444b7e350
(379127,105) | 9def5fe6343a5938b23af38444b7e350
(2 rows)
Time: 7.952 ms

Find below the prefix fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7a' limit 10;    
LOG: duration: 0.447 ms plan:
Query Text: select ctid,* from t_regexp where info ~ '^9def5fe6343a5938b23af38444b7a' limit 10;
Limit (cost=0.15..8.17 rows=1 width=39) (actual time=0.443..0.443 rows=0 loops=1)
Output: ctid, info
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=0.15..8.17 rows=1 width=39) (actual time=0.441..0.441 rows=0 loops=1)
Output: ctid, info
ctid | info
------+------
(0 rows)
Time: 7.968 ms

Find below the suffix fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495';    
LOG: duration: 5.287 ms plan:
Query Text: select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495';
Remote Fast Query Execution (cost=0.00..0.00 rows=0 width=0) (actual time=4.943..5.281 rows=1 loops=1)
Output: t_regexp.ctid, t_regexp.info
Node/s: data1, data10, data11, data12, data13, data14, data15, data16, data2, data3, data4, data5, data6, data7, data8, data9
Remote query: SELECT ctid, info FROM t_regexp WHERE (reverse(info) ~ '^1e346e5efc7703f11495'::text)
ctid | info
------------+----------------------------------
(19928,46) | cc302768b6a459411f3077cfe5e643e1
(1 row)
Time: 6.079 ms

Find below the suffix fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495123';    
LOG: duration: 4.157 ms plan:
Query Text: select ctid,* from t_regexp where reverse(info) ~ '^1e346e5efc7703f11495123';
Remote Fast Query Execution (cost=0.00..0.00 rows=0 width=0) (actual time=4.154..4.154 rows=0 loops=1)
Output: t_regexp.ctid, t_regexp.info
Node/s: data1, data10, data11, data12, data13, data14, data15, data16, data2, data3, data4, data5, data6, data7, data8, data9
Remote query: SELECT ctid, info FROM t_regexp WHERE (reverse(info) ~ '^1e346e5efc7703f11495123'::text)
ctid | info
------+------
(0 rows)
Time: 4.930 ms

Find below the fully fuzzy search performance with results.

digoal=# select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c949' order by info <-> '6ccd8ca827b0526cd57a71c949' limit 10;    
LOG: duration: 3066.193 ms plan:
Query Text: select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c949' order by info <-> '6ccd8ca827b0526cd57a71c949' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39) (actual time=3066.175..3066.176 rows=1 loops=1)
Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c949'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39) (actual time=3066.173..3066.174 rows=1 loops=1)
Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c949'::text)
ctid | info
--------------+----------------------------------
(459019,106) | 8536ccd8ca827b0526cd57a71c949ab2
(1 row)
Time: 3074.907 ms

Find below the fully fuzzy search performance without results.

digoal=# select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
LOG: duration: 3099.348 ms plan:
Query Text: select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39) (actual time=3099.341..3099.341 rows=0 loops=1)
Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39) (actual time=3099.339..3099.339 rows=0 loops=1)
Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)
ctid | info
------+------
(0 rows)
Time: 3108.698 ms

The GIN index is used.

digoal=# explain (verbose) select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;    
LOG: duration: 0.000 ms plan:
Query Text: explain (verbose) select ctid,* from t_regexp where info ~ '6ccd8ca827b0526cd57a71c123' order by info <-> '6ccd8ca827b0526cd57a71c123' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39)
Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39)
Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)
-> Limit (cost=72.03..72.03 rows=1 width=39)
Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))
-> Sort (cost=72.03..72.03 rows=1 width=39)
Output: ctid, info, ((info <-> '6ccd8ca827b0526cd57a71c123'::text))
Sort Key: ((t_regexp.info <-> '6ccd8ca827b0526cd57a71c123'::text))
-> Bitmap Heap Scan on public.t_regexp (cost=68.00..72.02 rows=1 width=39)
Output: ctid, info, (info <-> '6ccd8ca827b0526cd57a71c123'::text)
Recheck Cond: (t_regexp.info ~ '6ccd8ca827b0526cd57a71c123'::text)
-> Bitmap Index Scan on idx_t_regexp_gin (cost=0.00..68.00 rows=1 width=0)
Index Cond: (t_regexp.info ~ '6ccd8ca827b0526cd57a71c123'::text)

Find below the regex match performance with results (for example, 0 is confused with 8 and B, and 6 is confused with b).

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' limit 10;    
LOG: duration: 3181.592 ms plan:
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39) (actual time=3181.568..3181.569 rows=2 loops=1)
Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39) (actual time=3181.566..3181.566 rows=2 loops=1)
Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]efb'::text)
ctid | info
--------------+----------------------------------
(260416,110) | b44c9616bfa86a722daa9596f86ebefb
(504007,4) | b44c9616bfa86a722daa9596f86ebefb
(2 rows)
Time: 3189.666 ms

Consider another example, if you forget a few characters in the middle, and only know that they might be 2 to 5 numbers.

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa[\d]{2,5}6f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb' limit 10;    
LOG: duration: 3249.156 ms plan:
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa[\d]{2,5}6f86e[b|6]efb' order by info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39) (actual time=3249.136..3249.137 rows=2 loops=1)
Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39) (actual time=3249.134..3249.134 rows=2 loops=1)
Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa[0-9]{2,5}6f86e[b|6]efb'::text)
ctid | info
--------------+----------------------------------
(260416,110) | b44c9616bfa86a722daa9596f86ebefb
(504007,4) | b44c9616bfa86a722daa9596f86ebefb
(2 rows)
Time: 3257.435 ms

Find below the performance without results.

digoal=# select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' limit 10;    
LOG: duration: 3086.572 ms plan:
Query Text: select ctid,* from t_regexp where info ~ 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' order by info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12' limit 10;
Limit (cost=72.03..72.03 rows=1 width=39) (actual time=3086.567..3086.567 rows=0 loops=1)
Output: ctid, info, ((info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12'::text))
-> Remote Subquery Scan on all (data1,data10,data11,data12,data13,data14,data15,data16,data2,data3,data4,data5,data6,data7,data8,data9) (cost=72.03..72.03 rows=1 width=39) (actual time=3086.565..3086.565 rows=0 loops=1)
Output: ctid, info, (info <-> 'b44c9616bfa[8|0|B]6a722daa9596f86e[b|6]e12'::text)
ctid | info
------+------
(0 rows)
Time: 3094.683 ms

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store