Typical Cases on PostgreSQL\GPDB Millisecond-Level Massive Spatio-Temporal Data Pivoting


With the popularization of mobile devices, more and more business data includes spatial data such as FEED information for a mobile phone user, and Internet of Things, Internet of Vehicles, meteorological sensor data, animal source data, and a variety of tracking data.

Data is measured in the following dimensions:

  1. Space
  2. Time
  3. Business attributes such as temperature, humidity, cost, and oil consumption.

Data pivoting is crucial for company BI, analysts, and daily operation. It used be subject to a combination of business and temporal characteristics, but now we also do spatial pivoting (for example, pivoting near a certain point over a certain period of time, pivoting for a certain province, or pivoting over Northern latitudes.)

Precomputation, real-time computation, and StreamCompute are all critical to real-time pivoting.

Data in the examples above does not include a spatial element. In this article we will examine database design and DEMO for data pivoting when multiple dimensions like space, time, and business characteristics are in play.

I. Scenario

Let’s look at an example of spatial and temporal data pivoting.

Sensors are installed on Didis, taxis, buses, coaches, and hazardous chemical vehicles. These sensors are continuously generating and reporting data as their locations move along with the vehicles.

At the same time, there are also static sensors uploading and reporting data.

There are 3 data structures. Different reporting modes correspond to different structures.

1. Single data report mode

2. Batch report mode: aggregated details (one record includes multiple values)

3. Batch report mode: package details in JSON format

II. Architecture Design

Two data pivoting architecture designs are available: one is real-time bare calculation, which is aggressive and requires more computing capability, but is appropriate for situations when a model cannot be created.

The other architecture can be used when you can create a model. Pretreatment reduces costs and response time is improved.

See below:

I. Real-time architecture

Real-time mode, which involves drawing statistics from query results. The speed of query response depends on the cluster’s computational ability. Heavy investment is generally required if you want to keep response speeds high.

II. Pre-treatment architecture

Many pre-treatment methods are available: StreamCompute, T+N scheduling, and lambda scheduling.

1. Stream processing

PostgreSQL’s plug-in PipelineDB (will be converted to a plug-in in July) can be selected for StreamCompute. StreamCompute supports PostgreSQL’s built-in TTL, sliding window, estimated value statistics, aggregation statistics function, etc. Performance is good, as a single machine has a processing speed of 1 million lines/s.

The data source is written into PipelineDB for StreamCompute. If StreamCompute results (for example, the statistics dimension is day, 7 is set for TTL, and the statistics result on the previous day is written into the report library RDS PG or HDB PG every day) and detailed data of the data source needs to be reserved, you can write them into HDB PG or OSS.

This architecture design can achieve millisecond-level response times in situations where a model can be created. HDB PG’s parallel computing capability can also be applied to situations where models cannot be created (real-time calculations have to be performed in detail) for faster response times.

2. T+N scheduling

This is actually a common report system methodology. For example, detailed data is imported to the database or OSS in the early hours of the day, and a report is generated based on the model.

3. Lambda scheduling

T+N scheduling is only applicable when changing the StreamCompute node into HDB PG or RDS PG, incrementally making statistics of modeling data, and merging the statistics result through FUNCIONT and job scheduling.

III. Partition Rule Design

Partition rules are used in data nodes. A rule should take into account the method by which the data is queried, for example search or query based on temporal or spatial ranges, in which case we provide two partition dimensions.

Both PostgreSQL and HDB support multi-level partitioning, so we can make a multi-level partition based on the two dimensions.

1. For example, a partition based on temporal ranges will be recreated every day.

2. Geohash range. Geohash is a code value representing a box composed of longitude and latitude coordinates. The length of the code determines its precision (size). Code prefixes of adjacent boxes are identical. Therefore, you can code within the range by using geohash.

For example, if you need to search data over a certain period of time, you can use the partition rule and pick up the corresponding partitions to narrow down the query range.

If you need to search data within a certain spatial range, you can make the partition based on the geohash range and query the corresponding partitions to narrow down the query range.

HDB supports both geohash and geometry. You can create a GIST spatial index on geometry. KNN search is supported for spatial indexes (precise retrieval is irrelevant to the box).

IV. Distribution Rule Design

A distribution determines how data is distributed across multiple node layers. It should not be the same as the partition rule. We can either use a random distribution rule or base it on a business related data field. Either way, we need to take data skew into consideration.

1. Random distribution: data is randomly written into different nodes to ensure uniformity. But you need to invoke all data nodes during each query. Data re-distribution will be involved for JOIN operations.

2. Business ID: distribution by business ID. When querying by the business ID, you only need to invoke the data node where the corresponding business ID is located. But you need to consider data skew. For example, data may not be uniform if a business ID contains a large amount of data.

PS: (consistent HASH solves the issue of non-uniformity in data distribution.)

V. Precomputation Design

Pre-calculation is a method by which pivoting response time can be significantly reduced in situations where modeling can be used.

In addition to precomputation of business indicators, precomputation of temporal and spatial dimensions is also common.

There is a commonality for statistics. Statistics are made by grouping, for example by time (hour, day, month, year, etc.), and space (administrative region, box, geohash, grid, etc.).

Precomputation — Fixed Time

For example, business reports can be queried by day, month, and year. So we can precompute by day.

Multi-dimensional pivoting is done with the GROUPING SETS statement.

Precomputation — Sliding Time Window

In the movie Interstellar, the protagonist creates a five-dimensional space which allows him to move freely across time.

Data pivoting can operate on a similar concept. You may need to query statistics within any time window in past records.

For example, you may have to query statistics 30 minutes before and after 27 June 2017 13:50:00.

There are two methods by which sliding window can be implemented:

1. Non-precomputed: involves generating statistics directly based on data details.

2. If the sliding window requires precomputation, we need to calculate our statistics at a finer granularity. For example, if the window size is 1h, then we need to calculate in increments smaller than 1h. For example, if we calculate down to the half hour, then when we query the sliding window, then the statistical result will be derived from a second round of calculation performed on the results of the first calculation on our half hour data.

Sliding window is also implemented this way in PipelineDB.


Precomputation — Fixed Administrative Region

In order to perform precomputation on the spatial dimension, we need to group the spatial data based on business requirements and on the spatial data pivoting.


1. Business requirements are counted based on the district-level administrative region. Therefore, we can implement pre-calculation based on the district-level administrative region (picture taken from the Internet).

First, functional support must be provided for translating the administrative region ID. Second, return the ID of the administrative region where the longitude and latitude coordinates are located.

A traditional method is to denote the geometry of an administrative region by a table, and use a GIST index to quickly return the geometry of the region based on the longitude and latitude values.


2. Calculate by geohash prefix. For example, removing a 5-bit hash, and aggregating it (calculate the geohash value of the longitude and latitude through the PostGIS ST_GeoHash() function).

Geohash precision is as follows:

Precomputation — Sliding Space Window (Point Radiation)

Implementation of sliding spatial sliding windows is similar to that of temporal sliding windows. For example, we can calculate an 8 bit geohash code, and code by a 5-bit geohash to perform secondary aggregated pivoting.

A certain degree of distortion is inevitable when calculating statistics for spatial and temporal sliding windows. (because it is impossible for the boundary to be totally clear).

If this is a concern, you can consider real-time calculation.

Precomputation of business dimensions is similar to that of temporal and spatial dimensions. It will not be described here.

VI. Pivoting Design

Pivoting Requirement — Fixed Time

Pivoting Requirement — Sliding Window

Secondary aggregation is required for sliding made through precomputation. Follow these steps:

Pivoting Requirement — Fixed Administrative Region

Pivoting Requirement — Sliding Spatial Window (Point Radiation)

VII. Combining with OSS

If you don’t need to query specific details, or frequently query only a few details with low delay, then you don’t need to enter detailed data into the database. Rather you can just import it to OSS.

You can use RDS PG or HDB PG to concurrently access OSS and implement real-time pivoting or precomputation.

VIII. StreamCompute and Lambda

StreamCompute mode: create a stream based on the data structure, create a statistics view for the stream, activate the statistics view, write the data into the database in real time, and perform real-time statistics on flowing data in the database to generate statistics results.

If too many statistics results are provided, we can use TTL statistics view to regularly transfer them to HDB.


Define stream

Define TTL statistics view and maintain for one week

Activate statistics view

Regular data transfer

Data pivoting

IX. Summary

1. Why don’t we create a composite index of time and space fields?

If I need to query data for a certain point in the neighboring N kilometers at a certain time interval, how can I quickly locate the target detailed data?

You need to quickly converge to one dimension when creating an index in the actual and spatial dimensions. If you want to converge to another dimension, you need to obtain a tuple through heap, then recheck and filter.

Why don’t we create a composite index of time and space fields? Because queries are continuous, scanning all index entries in drive columns is required for the composite index.

Query efficiency for equivalent drive columns is very high using the composite index.

It is recommended that the partition be used for data queries of temporal and spatial dimensions. Partition is the most effective method of narrowing down the data range.

It is recommended that geohash range partition be used for the partition of the spatial dimension.

2. Precomputation is the core of millisecond-level pivoting on any dimension.

3. The data estimation type is HLL. It helps you quickly query the number of unique and newly added values.

4. If you need to aggregate detailed values, you can store them using fields like array and JSONB.

5. Alibaba Cloud RDS PG, HDB for PG, OSS, StreamCompute plug-in, cloud ETL scheduling task serviceprovide a comprehensive big data real-time calculation solution for millisecond-level data pivoting of any temporal or spatial dimension.

6. For spatial data, in addition to using the GiST index, we also provided a more cost-effective index, the BRIN index, which is excellent for filtering data structured by st_geohash.


Original Source


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.