Displaying Massive Amounts of Data in Real Time


Image for post
Image for post

With the development of Internet of Things, more data is generated at an increasingly fast speed. This IoT data may include vehicle traffic volume on highways, users’ residence data, water temperature sensor data, e-commerce feed data, and tracking data of online rideshare vehicles.

This massive amounts of data is also used for different purposes. Some data requires real-time streaming statistics, some requires time-series processing, and some requires full analysis.

Take large screens at a command center for example. Some requirements are very interesting, such as the following:

  1. Present the latest sensor top value data.
  2. Select a timeline and display people flow by region


1. Find the Latest Values Reported by All Sensors

Design the table structure — GID is a sensor ID, val is an uploaded value, and crt_time is time.

Consider 10,000 sensors and insert 10 million records.

Write 10 million pieces of sensor test data

Create an index

Use a recursive call to obtain the latest values of all sensors (use the maximum auto-increment ID of each sensor as the latest report tag)

It takes only 129 milliseconds to search 10 million records for the latest 10,000 records .

Why is it so fast? The reason is that the use of recursive queries reduces the amount of data to be scanned and computed.

Using a cursor to obtain 10 records at a time enables faster performance — it only takes 0.36 milliseconds to complete the same task.

2. Find the 10 Busiest Intersections by Vehicle in a City

Image for post
Image for post

Compared with the first example, data in this example is converged and sorted by value, and the largest value is required.

Assume that a sensor at each intersection continuously uploads vehicle traffic data at that intersection. The large screen displays the 10 intersections with the largest vehicle traffic.

I will still use the data in the preceding example for convenience.

It takes only 126 milliseconds to search 10 million records for the latest 10,000 values and return the 10 largest values.

3. Find the Busiest Shop by People Flow in a Specific Period in an Area

Image for post
Image for post

Compared with the two preceding examples, this example has two more dimensions:

One is the time dimension, which allows users to select a specific time period for analysis. The other is the region dimension, which allows users to specify a specific region to obtain data related to that region.


Unlike B-tree indexes, spatial indexes are not stored in a sequential manner. Spatial indexes are GiST indexes and use a structure similar to cluster partitioning. Therefore, it is impractical to combine spatial GiST indexes and index sorting to obtain the first record. This requires SORT on displayed records.

At the same time because the query criteria contains a time interval, it is also impractical to sort the non-driving columns (gid and VAL fields) of the index.

When can we combine querying and sorting of composite indexes?

Implicit sorting can only be used when all the columns before the column to be sorted are equivalent value queries and the order of the index is consistent with that of sorting. For example, index(a,b,c) supports where a=? and b=? order by c, but doesn’t support where a> ? and b=? order by c.

For test convenience, re-plan test data and use points in place of longitudes and latitudes. In a realistic business scenario, you can use the geometry type.

Insert 10 million pieces of test data, 10,000 shop IDs, and 100 million random points in a point matrix.

How can we speed up filtering and finding the largest value uploaded by spatio-temporal sensors?

Two optimization methods are available in different scenarios.

1. The total number of sensors (shops) is not too large (for example, 10,000 shops)

Use an index to search for the largest VAL of each GID and use partial indexes to avoid the long time problem; use CPU to complete the determination of points and surfaces.


If the minimum time range that we allow users to select is two hours, we can create a partial index every two hours. (Using so many partial indexes is neither proper nor reasonable. It is advised to create a partition every two hours after the optimization of partitioned tables in PostgreSQL 10.)

This interval includes a total of around 3,500,000 pieces of data.

Use this partial index and recursive calls to obtain the largest values among all shops in this interval. Then obtain data related to a specific region according to the point and surface determination and sort data to find the 10 largest values.

Results are returned in 135 milliseconds

The implementation code is as follows

2. The number of shops is too large, but after the spatio-temporal convergence, the number of records is relatively small (for example, millions of records)

In this case, consider using partition tables by time. Then construct spatial indexes.

Use a time condition to locate the specified partition and use spatial indexes to filter data. Use few CPU resources and compute the filtered data to obtain the top 10 shops.


2.1. Partition a table by time (for example, a partition every two hours; the reason has been explained above, and will not be discussed here.)

2.2. Create a spatial index

2.3. Create data perspectives

When a time condition is entered in SQL, PostgreSQL automatically locates the target table partition. Here we directly write into the TEST table for convenience.

Use the window query to obtain the top SQL


Implementation code

Kernel-Level Optimization (Support for Spatial Grid Table Partitions)

Allow PostgreSQL to support spatial GRID partitions (In fact, you can now use inheritance to implement this, and use grid+mod to determine which partition data should be inserted.)

Image for post
Image for post

For data perspective by the spatial and temporal dimensions, we can create spatial Grid partitions and sub-partitions of a time partition.

During the process of data retrieval, locate the target sub-partitions directly by using table partitions. Use B-tree indexes and recursive calls to filter peak data of each shop in intended intervals and use few CPU resources to calculate and find the top shops.

Querying can be implemented in no more than 100 milliseconds by using this method to create spatio-temporal four-dimensional data perspective.

Business Optimization Methods

1. In example 1 and example 2, since the application layer uses recent data, historical data is not involved. In addition to recursive optimization, another two optimization methods are available.

Method 1: Do not record historical data and replace insert with insert or update. By using this method, the results of the sort_test are always the latest values.

Method 2: Record both historical data and the latest status. Use a trigger to complete this job.


When data is inserted, the last record is automatically updated. (The volume of written data equals that of updated data)


Create a status table to record the latest status and create a trigger to automatically update the status table when historical data is written.

When data is inserted, the status is automatically updated.

When querying data, directly query the latest status table. Recursive calls are no longer required.

2. In example 3, historical data is analyzed by two dimensions: time and space.

Therefore, we can use one of the two dimensions as a partition and scramble data to create an index by the other dimension in that partition.

This allows data to be converged as much as possible.

Partitions are supported for both space and time. (It is recommended to use grid expressions for space partitions in order to easily find and locate partitions.)

Original Source


Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store