Real-time Statistical Analysis on Sliding Windows with PostgreSQL

Background

In real life, there is often a need for aggregation analysis. For example, the number of people stopping at each store in a mall at each time point. Some techniques can perceive the location of people. When you enter a certain area, a record is written, indicating that you have entered the area. When you leave, a record of you leaving is recorded. If you do not move for a long time, a heartbeat record is written regularly.

The number of people online at each point in time for an online game is also a common analysis. In this scenario, we can write an online record when users go online and an offline record when they go offline. Another example is analyzing the number of online and offline bikes at a point in time of a bike-sharing company. In this scenario, we can write an online record when users borrow bikes and an offline record when users return the bikes. At the same time, the state of the bike is queried at regular intervals.

This concept can also be extended to enterprise-level applications. For instance, we can analyze the minimum and maximum number of online sensors per minute for an Internet of Things (IoT) enterprise. We can write an online record when the sensor goes online and an offline record when it goes offline. At the same time, the state of the sensor is queried at regular intervals. These are very typical Front End Engineering and Design (FEED) applications that require generating the number of online objects of this system at each point in time. If generating by time period, the maximum and minimum number of online objects that are generated in each time period is actually the boundary of the range.

In this article, we’ll discuss how to use PostgreSQL to realize real-time statistical analysis of any sliding window with daily data increment of about 10 billion

Database Design

An IoT enterprise uses sensors to analyze its system status. When a sensor goes online, an online record is written, and when it goes offline, an offline record is written. At the same time, the state of the sensor is queried at regular intervals. This means that sensors that do not have a record within an hour are considered offline.

The enterprise needs to count the minimum and maximum number of online sensors per minute.

1. Table Structure

2. Index

Write 110.1 million test data records (assuming this is the data write volume for 1 hour, then 2642.4 million records are written throughout the day) and 1,001 sensor IDs.

3. TTL Data

The TTL for data, which ensures that the table is small and contains only data within the heartbeat time range.

The heartbeat is received every hour, so there must be data within 1 hour, and sensors without data are not counted. Therefore, the state only needs to be reserved within 1 hour.

One way to reserve is to use pipelinedb. Another way to reserve is to use two tables for polling.

4. Recursive Query

Use a recursive query to query the final state of the sensor efficiently.

The implementation plan is as follows:

Sample

It is very efficient. For 110.1 million pieces of data, the final online state can be obtained in 11 milliseconds.

An example of an online device with state=t is as follows:

5. Number of Sensors

Count the number of sensors online at any point in time. If the time for each device to go online is exact to seconds (crt_time is exact to seconds), then the number of sensors online at a maximum of 86,400 points in time per day is required regardless of the number of records.

For example, to count the number of sensors online at 2017-07-05 10:29:09, just add a time limit.

Adding this time limit has some performance impact, especially if the time is a long time ago: the more filtering, the more serious performance degradation.

Therefore, we recommend starting a query request every second in real time without adding this time limit.

6. Generate the Number of Online Objects of Each Second in the past at One Time

The frame query technique using window query. (A frame indicates the interval to the current record after records are ordered by time.)

7. Count the Maximum and Minimum Number of Online Objects per Minute

Query once per second and write data to the results table.

It only takes 12 milliseconds per query, so there are no problems calling it once per second. Count the maximum and minimum number of online objects in a minute.

Optimizing with Large Number of Sensor IDs

When the number of sensor IDs reaches 100,000, the query performance drops to 250 milliseconds.

If the number of sensor IDs is particularly large; for example, more than one million, the performance drops to 2.5 seconds. In this case, it is not appropriate to query once per second.

Therefore, how can we optimize when there are a lot of sensors?

A better method is to hash the data by sensor ID. For example, each partitioned table has 10,000 sensor IDs. When querying the number of online objects, all partitioned tables are queried concurrently, and this reduces RT.

Optimization Method 2: Subquery

If the sensor ID is maintained in another table, a subquery can be used to optimize for this example.

Write 100,001 sensor IDs and the reported data reported by 100 million sensors.

Use a subquery to query the last VALUE of each sensor ID.

The subquery loops several times to get a number of records equal to a. And, fetch the last VALUE.

Time-consuming example 1

Time-consuming example 2

Summary

Through the method (recursive query) mentioned in this article, we can achieve very fine-grained, real-time statistics of the state of a large number of tracked objects (a single machine supports real-time pivot of any sliding window with daily increment of about 10 billion).

This method is useful to draw the real-time state chart of tracked objects, such as a real-time heatmap and determining the number of sensors (users) online and offline in real time, and the maximum and minimum online and offline values of any sliding window.

Reference:https://www.alibabacloud.com/blog/real-time-statistical-analysis-on-sliding-windows-with-postgresql_594812?spm=a2c41.12889650.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.