Efficient Search on Massive Data with Multidimensional Attributes in Time, Spatial, and Object

Background

Activities by people and other objects lead to massive amounts of spatio-temporal data. If technologies could bring us back to the past, we may ask what the status of the past world would be?

In fact, this need also exists in databases.

Object Data Classification

Object data can be classified into two types: static data (relatively static, for example, a building) and dynamic data (such as activities of people and activities of IoT sensors).

Data Classification by Research Requirements

1. Spatio-temporal snapshot search

Some objects generate data at a relatively low frequency. For example, inert objects such as buildings and roads may not have any change in years. If data generated for these objects is written into databases and we query data by time range (for example, query data from 2017–07–01 to 2017–07–02), we may not find any data related to these objects. The reason is simply because no relevant data was written into databases at all within that time period.

2. Spatio-temporal behavior data search

Spatio-temporal behavior data refers to the feed data from dynamic objects like people’s activities.

For example, analyze the features of a population group in a specific region within a specific time period. Or analyze differences in the makeup of the crowds around universities during the week and on weekends.

Spatio-temporal snapshots are not within the scope of this article. For information about spatio-temporal snapshots, refer to the previously written article on this subject. Now let’s look at how to search for spatio-temporal behavior data.

Data Structure

Spatio-temporal behavior data contains three properties: time, space, and objects.

Unstructured data:

In addition to JSON, structured data can also be used for object description. Example:

SQL query example of spatio-temporal behavior data

Optimization Methods

Consider the following knowledge:

1. Time Series BRIN Indexes

The crt_time field is a time-series field that represents the time when data is generated. Storage and the value of this field have a strong linear correlation in PostgreSQL heap storage.

Therefore, BRIN indexes are very suitable.

I used a BRIN index for a TPC-H test in place of a partitioned table. The performance of searching in a large range is even better than that of using a partitioned table.

2. Spatial Indexes

Obviously, spatial retrieval requires spatial indexes. Three methods can be used in PostgreSQL to implement spatial retrieval.

1. GIST index for the geometry type

This index supports features such as spatial KNN search and spatial position determination.

2. SP-GiST index for the geometry type

This index supports features such as spatial KNN search and spatial position determination.

3. GEOHASH and B-tree indexes (convert longitude and latitude into GEOHASH and create a B-tree index for HASH VALUE) Simply use an expression index.

This index supports prefix search (which implements the relationship contained in the encoded geographic information grid). It belongs to the LOSSY index and requires secondary filtering.

The GiST and SPGiST spatial indexes can find accurate geographical location information and are better than the GEOHASH index. However, you need to pay special attention when querying information.

3. GIN Index

This index type targets the object property field JSONB or multiple structured object property fields. Simply use a GIN index.

Examples:

Unstructured index:

Structured index:

4. BitmapAnd BitmapOr

In the preceding section, different indexes are selected for different query dimensions based on data types and query requirements.

However, can these indexes be used at the same time? PostgreSQL provides the bitmapAnd and bitmapOr interfaces for multiple indexes, which can combine multiple indexes and reduce the number of databases that need to be scanned.

Example:

Proper indexes are automatically used based on statistics. If necessary, bitmapAnd or bitmapOr merge scans will be automatically performed on multiple indexes. Skip the pages that don’t need to be scanned and recheck the hit pages.

5. Heap Table Storage Grading and Partitioning

Storage can be separated into one level partition or multi-level partitions:

1. Single-level partitioning:

For example, partition by time.

2. Multi-Level partitioning

For example, partition first by time and then by GEOHASH.

When partitions are used, corresponding partitions are automatically located if query conditions include partition keys (such as time and space range), reducing the amount of data that needs to be scanned.

Create GIN indexes targeting object properties to implement extremely efficient query.

6. Index Grading and Partitioning

Like data, indexing also supports partitioning logic when partitioned tables are not used.

Example

Spatial index + time partition

By using the preceding partitioned index, target data can be quickly located after a time range is entered to perform spatial search.

More index partitions can be used, such as a dimension (object property) used as the search criterion and shop type (assume it is enumerable or in a relatively small range).

By using the preceding partitioned index, target data can be quickly located after a time range or specific conditions are entered to perform spatial search.

Note that the preceding SQL queries can implement extremely significant performance optimization.

Index organization forms (or index structures) can be restructured by logic partitions to cover all conditions in a manner similar to the preceding index creation method.

7. CTID Intersect Array JOIN SCAN

As mentioned earlier, BitmapAnd and BitmapOr merge scan is automatically performed in multiple indexes or GIN indexes. In fact, this scan can also be explicitly performed in SQL.

Each condition filters a corresponding CTID.

Use Intersect or Union to generate CTIDs that meet the overall requirements. (Intersect corresponds to the “and” condition; union corresponds to the “or” condition.)

Generate an array of ctids.

Example

1. Create an Object Feed Data Table

2. Write 50 Million Pieces of Test Data into the Table

3. Create an Object Index

4. Create a Time Index

5. Create a Spatial Index

6. Generate data layout to facilitate subsequent queries

7. Create an Extreme KNN Query Function

8. CTID Merge Retrieval

Display records that meet the following criteria

First look at each criterion separately, find how many records match a criterion and how much time overhead is incurred on an index scan.

1) 54,907 records

2) 95,147 records

3) 149,930 records (PostgreSQL uses bitmapOr for merge scanning to quickly get results)

4) 60,687 records (It still takes up to 195 milliseconds even if we use the excellent KNN performance optimization.)

Let’s look at how long it takes if we don’t use KNN optimization.

The result is very surprising — extreme optimization performance is improved by one order of magnitude.

5) 2640751 records

Use all indexes to scan data conditions one by one to get ctids and perform ctid scans. Now, let’s break down the procedure:

First let’s look at merged querying of time and object properties. The performance is excellent. When using bitmapAnd or bitmapOr, a query can skip most data blocks and scan time is shorter than that of a single-index scan.

Note that the number of records is reduced to 7847 in this step.

Then look at the scan time for KNN:

Note that 60,687 records meet the KNN distance condition, so I will explain the performance comparison between CTID merge scans and original scans.

Finally, we combine these pieces into ctid.

Obtain the final records.

It takes 462 milliseconds.

9. Test the Performance of the Original SQL Queries — PostgreSQL Multi-Index BitmapAnd BitmapOr Skip Scan

Write an SQL query directly instead of using multi CTID scans.

The performance is excellent as expected. We explained the reason earlier. Conditions other than the KNN condition have already converge results to 7,000 records, so it is not necessary use indexes containing KNN conditions (It takes 195 milliseconds even a KNN index is used because 60,687 records meet the KNN condition.)

Result verification:

Partitioned Index Example

Assume that the preceding query criteria remain unchanged. We use partitioned indexes to test the performance.

(This is to demonstrate the extreme effects of partitioned indexes. In practical scenarios, convergence level may not be that high, for example, converge by day or by ID HASH. As long as convergence is possible, we can implement excellent performance.)

Reconstruct the extreme KNN optimization function

Query performance:

Great! Query time is reduced from over 200 milliseconds to less than 1 millisecond.

Summary of Optimization Methods

A retrospective look at the optimization methods:

1. Build different indexes for different data types.

For example, use GiST or SP-GiST indexes for space, B-tree or BRIN indexes for time, and GIN indexes for multiple object properties.

The purpose of indexing is to reduce the scope of data scans.

2. Method 5 mentions data partitioning. The purpose of data partitioning is to organize data in an intentional manner, which means that data is intentionally organized to meet search requirements. For example, if time is a required query condition or a common query condition, then data can be split by time (partitions) to reduce the amount of data that needs to be scanned.

3. Method 6 describes index partitions. The purpose is similar to method 5. The difference between method 5 and method 6 is that partitions are used at the index level so that the data hit rate is improved directly when an index scan is performed.

4. The ctid merge scan in method 7 is similar to multi-index bitmapAnd or bitmapOr scans in PostgreSQL. bitmapAnd/bitmapOr skips blocks that don’t require scans, which ctid merge scan in method 7 skips row that don’t require scans.

Merge ctids obtained from multiple index scans. Skip the numbers of rows that don’t require scans.

If a filtering condition can significantly reduce ctid (records) when other conditions are AND, it is unnecessary to use a ctid merge scan. Instead, use FILTER as one other condition. (This will slightly increase the CPU overhead.)

5. The best Kung Fu always features ultimate flexibility, freedom, and unlimited imagination of each movement.

PostgreSQL implements multi-index bitmapAnd or bitmapOr scans, significantly improving the data hit rate for multiple conditions (indexes).

In addition, PostgreSQL features an excellent CBO estimation mechanism, which allows PostgreSQL to not always use all indexes for bitmap merge scans. This is also why the performance described in the section “Test the performance of the original SQL queries — PostgreSQL multi-index bitmapAnd bitmapOr skip scan” is better.

6. How to implement extreme optimization

Adopt method 5 or method 6 and use fixable conditions as partition keys to partition data or indexes.

For other conditions, multi-index bitmapAnd or bitmapOr scans in PostgreSQL can be used to improve the data hit rate for multiple conditions (indexes).

As we can see, time needed for multidimensional retrieval from 50 million pieces of data by time, space, and object properties is reduced to 0.592 milliseconds.

7. For spatial data, in addition to using the GiST index, we can also use a BRIN index, which requires a lower cost. The performance of filtering is excellent after data is well organized.

Original Source

https://www.alibabacloud.com/blog/efficient-search-on-massive-data-with-multidimensional-attributes-in-time-spatial-and-object_594987?spm=a2c41.13103962.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store