Double Eleven Technology Series: Replacing Word Segmentation with Regular Expression and Similarity

11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.

If you’re a fan of police procedurals such as CSI, you’ve probably watched a scene where the police uses an image recognition software to identify a criminal based on a partial or blurry image. In the past, this process is done manually and is not very reliable. Now, we can use technologies such as PostgreSQL, which searches images in the database and finds a match based on similarity.

However, I don’t want to talk about image searching today. Instead, I want to introduce text searching. Text searching may remind you about word segmentation, but word segmentation itself cannot meet all requirements. Why not? Well, let’s look at a simple example where word segmentation fails: If a user types in “hello w0rld” instead of “hello world”, how can the database match the text?

What about using a regular expression to match text? How can this be implemented? For example, if you type in a part of the domain name such as “firefox”, how would the database determine that “www.firefoxcn.org" is the correct match?

When the data volume is small and concurrency is low, a full table scan can be performed for such a query and the CPU can afford the filtering for these queries. However, if the business needs to handle a hundred million requests or data every day, full table scan and the high CPU payload are unaffordable.

How Does PostgreSQL Support Regular Expressions?

The details are provided in the pg_trgm code annotation, including four steps:

  1. The PostgreSQL regexp library converts regular expressions into NFA format (graphic phrase).
  2. The NFA format is converted into extended graphic format (trigrams), including the disassembled query phrase and NOT phrase.
  3. The unneeded trigrams are filtered out.
  4. The phrases are packaged into the TrgmPackedGraph format, which supports GIN and GIST indexing.

Examples of Regular Expression Query and Fuzzy Query

Create a test table including a character string type field. Insert 50 million random character strings.

Create the index supporting regular expression.

I have just shown the regular expression query to my colleague who works for other types of databases, and he was shocked by the speedy query.

Fuzzy query example (Actually, regular expression query includes fuzzy match. Here is a demonstration.)

The consumed time in the preceding example is calculated in the full table scan scenario. The CPU parallel computing supported by PG is not used in this example.

Index Optimization

Here, GIN indexing is performed on tokens. When the character string is longer, more tokens are used. If the index field is updated, GIN asynchronous merging needs to be optimized to improve the update performance.

https://www.postgresql.org/docs/9.6/static/sql-createindex.html

Similarity-based Query

Similarity-based query means that the database can match the input even if the text is incorrectly typed in. You may not remember every detail when doing a character puzzle or writing memoirs. In this situation, similarity-based query is useful.

Reference:https://www.alibabacloud.com/blog/double-eleven-technology-series%3A-replacing-word-segmentation-with-regular-expression-and-similarity_594103?spm=a2c41.12191988.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store