Double Eleven Technology Series: Word Segmentation and Index Searching

11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.

PostgreSQL has a good reputation in the search field. This is particularly evident in the GIS field where it has taken the leading position in the industry for many years. Fortunately for us, PostgreSQL turns out to be a great solution for e-commerce applications as well. In this article, we will introduce PostgreSQL index searching and word segmentation technologies used for Alibaba’s Double Eleven (Singles’ Day) annual online shopping festival.

Full Text Search Types

Full text search involves two data types: phrases and search word combinations.

https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html

1. Phrase Type

Character strings are converted into phrases based on the selected phrasing rule. The phrases can be considered as a group of abstracted lexemes.

Example:

2. Search Word Type

The search words refer to the words you want to search, for example, postgresql, or postgresql or mysql.

You also need to select a configuration rule to judge and filter lexemes. The phrases you want to search can be combined randomly.

Moreover, tsquery supports prefix query, for example:

Tsvector also involves the section and weight concepts, for example, body, title, and sub-title. There are four levels, A, B, C, and D, indicating the layer and location where the lexeme resides.

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:

Full Text Search Type Operator

We have talked about data types. Next, let’s learn about the data type operators. Like the operators in mathematics, word segmentation type has the match, include, add, and adjacent operators.

Example:

Full Text Search Type Functions

These functions are database built-in functions of the tsvector or tsquery type, supporting various commonly used functions.

Example:

Converting array into the phrase type, obtaining the current tsconfig (English, Chinese, etc.), obtaining the phrase length (number of lexemes), converting character strings into search words, and converting phrases into search words (including location information)

Image for post
Image for post

Example Use Case

The following describes the phrase conversion function added to PG 9.6, which supports adjacency.

For example, when you type in Chinese Taoist Culture, it is segmented into Chinese <-> Taoist <-> Culture. The match is positive only when the words are adjacent. Otherwise, the words cannot be matched.

The word segmentation such as “Chinese population census, Taoist proportion xx, education level xx” cannot be matched. If you want a positive match, type in Chinese & Taoist & Culture. This function is convenient.

Phrases supporting or not supporting indexing.

Add or remove the weight (namely, ABCD) of tsvector.

Remove the specified lexeme from tsvector. If you know some words are meaningless, remove them.

Filter phrases based on weight. For example, only check whether the title and sub-title are matched (assume that the weights of title and sub-title are A and B respectively).

The words in bold are positively matched. This is a useful function, especially in demonstration.

For example, some blog platforms set sensitive words, which will be filtered out. With this function, I can quickly find the words that match the sensitive word rules.

The match percentage is useful in many scenarios. For example, rank the words based on match degree. The word with the highest match percentage is ranked first.

Phrases include the title, sub-title, body, and section (namely, weight), so different coefficients can be set for different weights, for example:

That is, the match degree is calculated based on the weights you have set.

Rewrite search words, like SQL REWRITE or the text replacement function.

Calculate the phrase and convert it into tsquery, including the lexeme distance information.

Convert tsvector into array, which does not include location information.

The automatic phrase update function is important. If you have updated the text fields, how are the phrases automatically updated?

Expand tsvector to convert it into multiple records.

Full Text Search Type Debugging Functions

If you have added a dictionary, and set or modified the phrasing rule, you may need to view the phrasing result after modification.

Phrasing includes several steps, for example, splitting a character string into tokens (including location information and weight) according to the dictionary, filtering unneeded tokens based on the token properties and ts config, and returning tsvector.

Using the debugging functions can display the original splitting information.

https://www.postgresql.org/docs/9.6/static/functions-textsearch.html

Example:

Image for post
Image for post

Indexing

In addition to type and function, a database must support indexing of a data type; otherwise, the database cannot fully support this data type.

The phrases in PostgreSQL support indexes, including GIN, GiST, and SP-Gist.

The following is an example of using indexes.

Multilingual Phrasing

PostgreSQL supports customized dictionary and phrasing configuration.

Chinese phrases can be added. There are many methods of adding Chinese phrases on the Internet.

https://github.com/fxsjy/jieba

https://github.com/jaiminpan/pg_scws

For this purpose, you can consider using Alibaba Cloud RDS PostgreSQL because it contains the zhparser phrase.

Fuzzy Query

Word segmentation and fuzzy search are two functions. Word segmentation cannot provide the fuzzy match function. For example, if you type in a regular expression or condition with fuzzy prefix and suffix for search, fuzzy search can hit the text that cannot be searched by word segmentation.

Word segmentation can match only the prefixes, but do not support the fuzzy prefixes, suffixes, or regular expressions.

If you require the fuzzy search function, use the pg_trgm plugin in PostgreSQL for indexing.

Similarity-based search and fuzzy match are supported.

Advantages of PostgreSQL

As you can see from the examples above, it is really easy to apply PostgreSQL to real-life e-commerce applications. PostgreSQL’s powerful search and indexing features has helped Alibaba make the Double Eleven online shopping festival successful year after year. The problems used to be solved by search engine or external plugin can be solved by PostgreSQL today. Here are some of the advantages of using PostgreSQL for word segmentation and index searching:

  1. You do not need to transfer data from one database to another. Real-time search is implemented with a low cost.
  2. Text and phrases are stored in one database, ensuring the consistency in searching.
  3. The numbers of search times and returned results are not limited, providing controllable flexibility.
  4. The lexeme weights and weight coefficients can be customized to improve search flexibility.
  5. Phrases can be ranked and searched by similarities.
  6. Lexeme distance coefficients are supported. The search can be implemented based on absolute distance coefficients. For example, ‘Fast & Furious’ is converted into ‘Fast <1> Furious’, and the movie can be accurately matched. It is impossible to match ‘speed is fast, very furious’.
  7. Phrases can be customized. If the pre-defined phrases cannot meet your requirement, define your own phrases.
  8. The tsvector granularity is adjustable.

References

PostgreSQL has a long history. It originated from University of California, Berkeley with a BSD-LIKE license. PostgreSQL is user-friendly; many products have been developed based on PostgreSQL.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

To learn more about PostgreSQL on Alibaba Cloud, visit www.alibabacloud.com/product/apsaradb-for-rds-postgresql

Reference:https://www.alibabacloud.com/blog/double-eleven-technology-series%3A-word-segmentation-and-index-searching_594097?spm=a2c41.12185947.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store