How to Use PostgreSQL to Efficiently Search With Split Fields?

Image for post
Image for post

By Digoal.

In some application scenarios, multi-field matching may be required. For example, consider a scenario in which a table includes several fields like artist, track, album, composer, and Lyrics.

In this kind of scenario, users may want to use the word-breaking function to find matches for singers and performers like the Hong Kong singer Andy Lau or Canadian personality Justin Bieber, for example, and have it return TRUE if any field matches. The traditional practice is to establish a word-breaking index on each field and perform the matching one by one. However, this leads to lengthy SQL statements and requires lots of OR operations. Can a better method be used in this scenario?

Well, actually, the answer’s yes. We can convert this whole record into one string and create a word-breaking index on this string. At this point, the question is what is the format of the output, and if this will impact the result of the word-breaking.

Output Record Format in PostgreSQL

Now, create a table and inset the value.

create table t1(id int, c1 text, c2 text, c3 text);    
insert into t1 values (1 , 'speed-fast e5a1cbb8' , 'speed-fast e5a1cbb8' , 'abc');

postgres=# select t1::text from t1;
t1
-----------------------------------
(1,speed-fast e5a1cbb8,speed-fast e5a1cbb8,abc)
(1 row)

postgres=# \df+ record_out
List of functions
Schema | Name | Result data type | Argument data types | Type | Security | Volatility | Owner | Language | Source code | Description
------------+------------+------------------+---------------------+--------+----------+------------+----------+----------+-------------+-------------
pg_catalog | record_out | cstring | record | normal | invoker | stable | postgres | internal | record_out | I/O
(1 row)

Now consider the source code of the record output src/backend/utils/adt/rowtypes.c.

/*  
* record_out - output routine for any composite type.
*/
Datum
record_out(PG_FUNCTION_ARGS)
{
...
/* And build the result string */
initStringInfo(&buf);

appendStringInfoChar(&buf, '('); // Use parentheses (()) to enclose the field.

for (i = 0; i < ncolumns; i++)
{
...
if (needComma)
appendStringInfoChar(&buf, ','); // Use a comma (,) between fields.
needComma = true;
...
/* Detect whether we need double quotes for this value */
nq = (value[0] == '\0'); /* force quotes for empty string */
for (tmp = value; *tmp; tmp++)
{
char ch = *tmp;

if (ch == '"' || ch == '\\' ||
ch == '(' || ch == ')' || ch == ',' ||
isspace((unsigned char) ch))
{
nq = true;
break;
}
}

/* And emit the string */
if (nq)
appendStringInfoCharMacro(&buf, '"'); // For some types, use the double quotation markers ("")
for (tmp = value; *tmp; tmp++)
{
char ch = *tmp;

if (ch == '"' || ch == '\\')
appendStringInfoCharMacro(&buf, ch);
appendStringInfoCharMacro(&buf, ch);
}
if (nq)
appendStringInfoCharMacro(&buf, '"');
}

appendStringInfoChar(&buf, ')');
...

Problems with SCWS Word-Breaking

So it may seem that no problems should occur. Only a comma and double quotes are added. These are all characters and scws should be able to process them.

In reality, however, some problems do exist in this scenario. Consider the following example. The two words only differ in the ending parts. If an additional comma is added, the result is like this.

postgres=# select * from ts_debug('scwscfg', 'speed-fast e5a1cbb8,');  
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+-------+--------------+------------+---------
k | head | speed | {} | |
a | adjective | fast | {simple} | simple | {fast}
e | exclamation | e5a | {simple} | simple | {e5a}
e | exclamation | 1cbb | {simple} | simple | {1cbb}
e | exclamation | 8 | {simple} | simple | {8}
u | auxiliary | , | {} | |
(6 rows)

postgres=# select * from ts_debug('scwscfg', 'speed-fast e5a1cbb8');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+----------+--------------+------------+------------
k | head | speed | {} | |
a | adjective | fast | {simple} | simple | {fast}
e | exclamation | e5a1cbb8 | {simple} | simple | {e5a1cbb8}
(3 rows)

Problem analysis

The following diagram shows the word-breaking steps in PostgreSQL.

First, use parse to split a string into multiple tokens and specify the type of each token.

Therefore, when creating the text search configuration, you need to specify a parser, which is also the core part of the word-breaking process.

Command:     CREATE TEXT SEARCH CONFIGURATION  
Description: define a new text search configuration
Syntax:
CREATE TEXT SEARCH CONFIGURATION name (
PARSER = parser_name |
COPY = source_config
)

At the same time, token types supported by the parser must also be specified when you create the parser.

Command:     CREATE TEXT SEARCH PARSER  
Description: define a new text search parser
Syntax:
CREATE TEXT SEARCH PARSER name (
START = start_function ,
GETTOKEN = gettoken_function ,
END = end_function ,
LEXTYPES = lextypes_function
[, HEADLINE = headline_function ]
)

View parsers that have been created.

postgres=# select * from pg_ts_parser ;  
prsname | prsnamespace | prsstart | prstoken | prsend | prsheadline | prslextype
---------+--------------+--------------+------------------+------------+---------------+----------------
default | 11 | prsd_start | prsd_nexttoken | prsd_end | prsd_headline | prsd_lextype
scws | 2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype
jieba | 2200 | jieba_start | jieba_gettoken | jieba_end | prsd_headline | jieba_lextype
(3 rows)

View the token types that a parser supports.

postgres=# select * from ts_token_type('scws');  
tokid | alias | description
-------+-------+---------------
97 | a | adjective
98 | b | difference
99 | c | conjunction
100 | d | adverb
101 | e | exclamation
102 | f | position
103 | g | word root
104 | h | head
105 | i | idiom
106 | j | abbreviation
107 | k | head
108 | l | temp
109 | m | numeral
110 | n | noun
111 | o | onomatopoeia
112 | p | prepositional
113 | q | quantity
114 | r | pronoun
115 | s | space
116 | t | time
117 | u | auxiliary
118 | v | verb
119 | w | punctuation
120 | x | unknown
121 | y | modal
122 | z | status
(26 rows)

Each token type corresponds to one or more dictionaries for matchmaking.

ALTER TEXT SEARCH CONFIGURATION name  
ADD MAPPING FOR token_type [, ... ] WITH dictionary_name [, ... ]

View the map information of configured token types and dictionaries.

postgres=# select * from pg_ts_config_map ;

Next, for the first matching dictionary, convert the token output into a lexeme.

(Will remove stop words), delete plural forms, so on.

The following functions can be used to debug word-breaking problems.

  • ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text): Returns token types supported by the parser.
  • ts_parse(parser_name text, txt text, OUT tokid integer, OUT token text): Specifies a parser and returns the string as a token.
  • ts_debug(config regconfig, document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[]): Specifies the word-breaking configuration and returns the string as a token and additional information.

In the preceding case, when the scws parser is used, the token output changes.

postgres=# select * from pg_ts_parser ;  
prsname | prsnamespace | prsstart | prstoken | prsend | prsheadline | prslextype
---------+--------------+--------------+------------------+------------+---------------+----------------
default | 11 | prsd_start | prsd_nexttoken | prsd_end | prsd_headline | prsd_lextype
scws | 2200 | pgscws_start | pgscws_getlexeme | pgscws_end | prsd_headline | pgscws_lextype
jieba | 2200 | jieba_start | jieba_gettoken | jieba_end | prsd_headline | jieba_lextype
(3 rows)

postgres=# select * from ts_parse('scws', '子远e5a1cbb8,');
tokid | token
-------+-------
107 | speed
97 | fast
101 | e5a
101 | 1cbb
101 | 8
117 | ,
(6 rows)

Solution

Without modifying the scws code, we can first replace the comma with a space, because scws will ignore spaces.

postgres=# select replace(t1::text, ',', ' ') from t1;   
replace
-----------------------------------
(1 speed-fast e5a1cbb8 speed-fast e5a1cbb8 abc)
(1 row)

postgres=# select to_tsvector('scwscfg', replace(t1::text, ',', ' ')) from t1;
to_tsvector
---------------------------------------
'1':1 'abc':6 'e5a1cbb8':3,5 'fast':2,4
(1 row)

Full-text Search Index

postgres=# create or replace function rec_to_text(anyelement) returns text as 
$$

select $1::text;
$$
language sql strict immutable;
CREATE FUNCTION

postgres=# create index idx on t1 using gin (to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')));
CREATE INDEX

SQL Writing
postgres=# explain verbose select * from t1 where to_tsvector('scwscfg', replace(rec_to_text(t1), ',', ' ')) @@ to_tsquery('scwscfg', 'speed-fast e5a1cbb8');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t1 (cost=4.50..6.52 rows=1 width=100)
Output: c1, c2, c3, c4
Recheck Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''fast'' & ''e5a1cbb8'''::tsquery)
-> Bitmap Index Scan on idx (cost=0.00..4.50 rows=1 width=0)
Index Cond: (to_tsvector('scwscfg'::regconfig, replace(rec_to_text(t1.*), ','::text, ' '::text)) @@ '''fast'' & ''e5a1cbb8'''::tsquery)
(5 rows)

References

  • For work-breaking, each CPU core can process about 44,400 words per second.
postgres=# create extension pg_scws;  
CREATE EXTENSION
Time: 6.544 ms
postgres=# alter function to_tsvector(regconfig,text) volatile;
ALTER FUNCTION
postgres=# select to_tsvector('scwscfg','How can I speed up the loading speeds of PostgreSQL?');
to_tsvector
-----------------------------------------------------------------------------------------
'postgresql':4 'How':2 'How can I':1 'word-breaking':6 'speed up':3 'loading':7 'PostgreSQL':5 'speeds':8
(1 row)
Time: 0.855 ms
postgres=# set zhparser.dict_in_memory = t;
SET
Time: 0.339 ms
postgres=# explain (buffers,timing,costs,verbose,analyze) select to_tsvector('scwscfg','How can I speed up the loading speeds of PostgreSQL?') from generate_series(1,100000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series (cost=0.00..260.00 rows=1000 width=0) (actual time=11.431..17971.197 rows=100000 loops=1)
Output: to_tsvector('scwscfg'::regconfig, 'How can I speed up the loading speeds of PostgreSQL?'::text)
Function Call: generate_series(1, 100000)
Buffers: temp read=172 written=171
Planning time: 0.042 ms
Execution time: 18000.344 ms
(6 rows)
Time: 18000.917 ms
postgres=# select 8*100000/18.000344;
?column?
--------------------
44443.595077960732
(1 row)

For the CPU:

Architecture:          x86_64  
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping: 2
CPU MHz: 2494.224
BogoMIPS: 4988.44
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-31

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store