How to Use PostgreSQL to Efficiently Search With Split Fields?

Image for post

By Digoal.

In some application scenarios, multi-field matching may be required. For example, consider a scenario in which a table includes several fields like artist, track, album, composer, and Lyrics.

In this kind of scenario, users may want to use the word-breaking function to find matches for singers and performers like the Hong Kong singer Andy Lau or Canadian personality Justin Bieber, for example, and have it return TRUE if any field matches. The traditional practice is to establish a word-breaking index on each field and perform the matching one by one. However, this leads to lengthy SQL statements and requires lots of OR operations. Can a better method be used in this scenario?

Well, actually, the answer’s yes. We can convert this whole record into one string and create a word-breaking index on this string. At this point, the question is what is the format of the output, and if this will impact the result of the word-breaking.

Output Record Format in PostgreSQL

Now, create a table and inset the value.

Now consider the source code of the record output src/backend/utils/adt/rowtypes.c.

Problems with SCWS Word-Breaking

So it may seem that no problems should occur. Only a comma and double quotes are added. These are all characters and scws should be able to process them.

In reality, however, some problems do exist in this scenario. Consider the following example. The two words only differ in the ending parts. If an additional comma is added, the result is like this.

Problem analysis

The following diagram shows the word-breaking steps in PostgreSQL.

First, use parse to split a string into multiple tokens and specify the type of each token.

Therefore, when creating the text search configuration, you need to specify a parser, which is also the core part of the word-breaking process.

At the same time, token types supported by the parser must also be specified when you create the parser.

View parsers that have been created.

View the token types that a parser supports.

Each token type corresponds to one or more dictionaries for matchmaking.

View the map information of configured token types and dictionaries.

Next, for the first matching dictionary, convert the token output into a lexeme.

The following functions can be used to debug word-breaking problems.

  • ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text): Returns token types supported by the parser.
  • ts_parse(parser_name text, txt text, OUT tokid integer, OUT token text): Specifies a parser and returns the string as a token.
  • ts_debug(config regconfig, document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[]): Specifies the word-breaking configuration and returns the string as a token and additional information.

In the preceding case, when the scws parser is used, the token output changes.

Solution

Without modifying the scws code, we can first replace the comma with a space, because scws will ignore spaces.

Full-text Search Index

References

  • For work-breaking, each CPU core can process about 44,400 words per second.

For the CPU:

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store