Store Operations Optimization: Search Acceleration over PostgreSQL Arrays, JSON and Internal Tag Data

Image for post
Image for post

Background

The PostgreSQL arrays, JSON data, and other data types offer a lot of convenience to sellers for managing store operations. However, the optimization of these data types becomes increasingly complicated. For example, if a seller stores the information of some products in the database and attaches multiple tags in JSON format to each product. Then, the data structure may appear as follows.

Consider the following sample data, where:

  • Each jsonb[] contains multiple JSON arrays.
  • Each JSON array contains multiple IDs.
  • Each ID corresponds to a score.

Assume that the seller wants to search for data, where “gid=?”, the data contains a certain ID, and the ID score is within a certain range. The purpose is to find data that meets certain tag conditions in a store.

Now, write a User Defined Function (UDF) to implement the query of the specific class.

The final SQL statement is shown below.

The preceding method only uses the Group Identification (GID) index. Therefore, it requires a large amount of CPU computing which inevitably becomes a bottleneck during high concurrency.

Performance Optimization Ideas

The critical question is how to accurately perform index retrieval on data and improve the performance when a known property has a maximum of N JSON elements.

In fact, the SQL statement can be changed as follows.

Or

The SQL statement contains many OR conditions, but it does not affect the performance because the GID, ID, and score are all set to fixed values, and the ID is unique in the same record. Although multiple trees are scanned, the I/O or CPU usage does not increase upon the final retrieval considering the same record can only be retrieved from one tree.

Optimization Method

Follow the steps listed below to optimize the performance:

Step 1. Set the number of JSON elements in the prop field to a fixed value.

Step 2. Construct a composite expression index for each JSON element.

Step 3. Rewrite the SQL statement.

Step 4. Use UDF to splice dynamic SQL statements or splice dynamic SQL statements at the program end.

The following statement shows the dynamic query splicing.

The index is correctly used as shown below.

Step 5. Add a composite express index if the number of elements increases.

Note that this method involves many indexes, so it might affect the Data Manipulation Language (DML) performance. However, there is a significant improvement in query performance.

Additional Knowledge

If you do not want to use a composite index, use a single column expression index instead.

You can apply this approach to implement searches by equivalent and range. Consider the following example.

Integrate GID and prop-> ID into the index.

LPAD function is also critical. Scores may vary by the number of digits, and therefore, the query result in the TEXT range may not meet expectations. In this case, LPAD can be used to ensure that the scores have the same number of digits to achieve the same sequence effect as numerical values.

The effect after padding is as follows.

Performance Improvement Test

Implement the following steps to test the performance.

Step1. Create a table.

Step 2. Construct data by creating 10 thousand GIDs, consisting of 100 thousand items each.

Make each prop have 10 JSONBs, and set the ID range of each JSONB to 0 to 1000 and the score range to 0 to 100.

Now, write 1 billion records as shown below.

Step 3. Next, you need to perform the indexing.

Step 4. Perform the pressure test by using the original method.

Step 5. Perform the pressure test by using the optimized method.

Conclusion

1. When you use the multi-tree APPEND method described in this document, there is no wastage of CPU resources and the performance improves by N times.

Maximum data capacityCaseTPSRTCPU usage1 billionOriginal method96583 ms100%1 billionOptimization method17,0003.3 ms100%

2. We recommend you to perform indexing by partition to optimize the database kernel. At the kernel layer, one BTree index corresponds to multiple trees, which solves the problem related to the multi-value point query and interval query in the array.

Image for post
Image for post

3. Currently, PostgreSQL GIN and GiST indexes only support the retrieval of the multi-value data, array, and JSON data in the inclusion and overlapping relations. It does not support query by value range. To support query by value range, you need to improve the inverted tree and develop the corresponding OPS.

Image for post
Image for post
Source:https://www.postgresql.org/docs/10/static/datatype-json.htm

The RUM index API can realize this function to some extent.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store