Full-text Search Index Optimization — Alibaba Cloud RDS PostgreSQL Best Practices

Background

With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays.

Will indexes be used in full-text searches that do not contain a certain keyword?

GIN stands for Generalized Inverted Index. A query that does not contain a keyword is actually a scanning process that skips the token above the main tree.

It will be cost-effective if the skipped token contains a relatively large amount of data. PostgreSQL uses a CBO (cost-based optimizer) execution plan optimizer to automatically choose the best index.

Example 1. Full-text search — “Does Not Contain” query

1.Create a test table

2.Create a function to generate random strings

3.Insert 1 million entries of test data

4.Create full-text index (a GIN index)

5.Query one of the records

6.Test — Does Not Contain a keyword

The database automatically chooses Full Table Scan, and does not use the GIN index.

Why wasn’t an index used? As I have explained before, the number of data records containing the keyword is quite small. It is not cost-effective to use indexes for filtering in a query that does not contain a keyword.

(However, if the query contains a keyword, it will be very cost-effective to use the GIN indexes. “Contains” and “Does Not Contain” are inversions of each other, and so are the associated costs.)

7.Force disable Full Table Scan, and allow the database to choose the indexes.

We can see that, searches that use index are really slow. Most of the time we can trust the database to find the most efficient method to complete our queries. (All of these can be calibrated as long as cost factors and environmental performance measuring are accurate enough.)

Example 2. Full-text search — “Does Not Contain” query

In this example, I am going to create unevenly distributed data. This token contains a large number of repeated content, and I will filter them out with a “Does Not Contain” operator. Let’s see if the index will be used.

1.Generate test data

2.Test search entries that do not contain ABC

The database automatically chooses indexed scanning and skips data blocks that do not need to be searched.

3.Force enable Full Table Scan. We can see that the performance is truly no match for indexed scanning, which proves that PostgreSQL is a cost-based optimizer and can automatically choose the most optimal execution plan.

Example 3. Ordinary B-tree indexes — “Not Equal To” searches

This example is an ordinary search that uses a B-tree index. Let’s see if PostgreSQL supports the “Not Equal To” indexed search.

The test method is similar to that of GIN test. In this test, I am going to use both unevenly and evenly distributed data.

1.For a query that uses the “Does Not Contain” operator on unevenly distributed data, the number of records filtered out with indexes is very small.

So far, “Does Not Contain” searches that uses the B-tree indexes are not supported at the kernel level. (Although this can be achieved by skipping Branch nodes that do not need to be scanned with INDEX SKIP SCAN.)

2.Change the SQL write method to achieve indexed search. As INDEX SKIP SCAN is not used in this case, we need a JOIN process. The resulting performance is not very good.

3.PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan.

Parallel scanning can remarkably improve the performance in case with a significantly large number of records.

Performance see marked improvement when we use a parallel query.

Example 4. Ordinary partial B-tree indexes — “Not Equal To” searches

For fixed “Not Equal To” queries, we can use the partial index function of PostgreSQL.

Cool! It takes only 0.03 ms to run a “Not Equal To” search over 10 million entries of data using partial index.

Summary

With built-in GIN indexes, PostgreSQL supports full-text search and searching multi-value data types including arrays.

PostgreSQL uses a cost-based execution plan optimizer. It automatically chooses the best execution plan. When running a “Does Not Contain” search, PostgreSQL automatically chooses whether to use indexed scanning.

With regards to B-tree indexes, technically it can be used to implement the “Not Equal To” searches (INDEX SKIP SCAN), but it is currently not supported at the kernel level. We can use indexed scanning by adjusting the current SQL write methods.

PostgreSQL uses multi-core parallel computing to allow incredible performance improvements to Full Table Scan. Parallel scanning can remarkably improve the performance in case with a significantly large number of records.

PostgreSQL supports partial index, which supports partitioned or partial indexes. Performance is outstanding for “Not Equal To” queries with fixed conditions.

Reference:

https://www.alibabacloud.com/blog/Full-text-Search-Index-Optimization---Alibaba-Cloud-RDS-PostgreSQL-Best-Practices_p339391?spm=a2c41.11149574.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store