Streaming Statistics in PostgreSQL with INSERT ON CONFLICT

By Digoal

In streaming statistics scenarios, the constant appending of data-in real time-allows for the continuous reanalysis of existing data sets, which in turn also allows you to gain further insights about your entire body of data as it changes over time. At the core of streaming statistics and this type of data analytics, of course, are aggregate functions, which include count, avg, min, max, and sum. Although simple, these functions are powerful. They can quickly provide interesting insights about your data and can be used to render statistical charts on real-time dashboards.

Another primary part of streaming statistics is the convention of the FEED logs, which contain FEED files and output reports, which can be used to keep track of the flow of data and how this data is behaving in any one particular system, allowing you to have a stronger grasp on your data. FEED logs are an important part of the streaming statistics conventions used at Alibaba, and are used for systems such as Alibaba's logistics giant Cainiao, mass e-commerce platform Taobao, and the more recent Alibaba Games, as well as those of other business systems. The statistical results of these FEED logs are output based on various parameters and dimensions, which include, for example, the real-time FEED statistics and real-time online users in all dimensions.

Interested in using streaming statistics to see how data is being used in your systems? Well, you can use the PostgreSQL’s INSERT ON CONFLICT syntax and the RULE and TRIGGER functions to implement real-time data statistics. The performance indicators of a 56-core Alibaba Cloud Elastic Compute Service (ECS) instance are as follows:

The single-instance, single-table, or single-row stream processing performance can reach up to 0.39 million rows per second, and batch write stream processing performance can reach up to 3.36 million rows per second.

The above diagram shows how you can use the statistical data from FEED log outputs combined with the aggregate functions as well as the other capabilities of PostgreSQL in a streaming statistics scenarios. That is, you can use a combination of these to be able to build a real-time statistics system relatively easily.

In this article, we will look how you can build a solution using these in a streaming statistics solution combined with the statistics gathered from your ECS instances-specifically the type discussed above.

Collecting Statistics with PostgreSQL

In this example, you will collect statistics using the aggregate function of min, max, sum, and record count of the values of each SID in real time. To do this, follow these steps below:

1. Create a test table that contains statistical fields that are automatically generated by PostgreSQL.

2. Create a detail table to check whether the PostgreSQL stream computing results are correct.

3. The stream computing algorithm is completed by the following the INSERT ON CONFLICT SQL statement:

4. Write massive data volumes for testing.

5. Verify that the algorithm is correct.

After real-time statistics, the query response time drops from 1817 ms to 0.5 ms, and the performance is improved by a factor of nearly 40,000.

With this said, let’s look at what happens as we stress test the system:

Single-Table Stress Testing

Single-Instance Streaming Statistics Performance

This test case involves a single dimension and multiple tables. Based on the remaining CPU space, the performance should be:

0.385 million rows/s

Single-Table and Batch Writing Performance Stress Testing

The performance should be: 1.1765 million rows/s.

This test case involves a single dimension, single table, and batch writing. Based on the remaining CPU space (a single ECS instance uses multiple instances or the UNLOGGED TABLE), the estimated performance should be: 3.36 million rows/s.

Streaming Statistics Solution

Now as a solution the above stress testing, you can collect statistics based on more than one dimension, for example, based on multiple fields in the detail table.

Consider the following example:

In this case, how can you implement streaming statistics?

Besides PipelineDB, you can also use INSERT ON CONFLICT and RULE (or TRIGGER) of PostgreSQL to implement the same functions.

Process Design

  1. Define a detail table.
  2. Define a target statistical table for each dimension.
  3. Define the INSERT ON CONFLICT SQL statement for each dimension table.
  4. Define the TRIGGER or RULE for the detail table and call INSERT ON CONFLICT sequentially to write data to multiple dimension tables.

Example

1. Define a detail table.

2. Define a target statistical table for each dimension.

3. Define the INSERT ON CONFLICT SQL statement for each dimension table.

4. Define the TRIGGER or RULE for the detail table and call INSERT ON CONFLICT sequentially to write data to multiple dimension tables.

5. Conduct a test.

6. Check the test results.

7. Verify the test results.

Data is not written to the local table. If you change RULE to DO ALSO, data will be written to the local table. This result is satisfactory.

Parallel Design within an Instance

Define a detail partition table.

Example

  1. Define a detail partition table.
  2. Define a target statistical table for each dimension.
  3. Define the INSERT ON CONFLICT SQL statement for each dimension table.
  4. Define the TRIGGER or RULE for the partition table and call INSERT ON CONFLICT sequentially to write data to multiple dimension tables.

Parallel Design Outside an Instance

Define the upper-layer hash distribution write.

Example

Upper-layer applications or middleware implement multiple PostgreSQL instances to write data in a distributed manner.

Use INSERT ON CONFLICT Together with Hyperloglog to Implement Real-Time UV Estimation

You can use INSERT ON CONFLICT together with Hyperloglog (HLL) to collect real-time UV statistics.

Stress testing result: 180,000 TPS

Query the estimated UV value as follows based on the HLL type. The query result is reliable.

For more information about the HLL plug-in, see the following references:

https://github.com/aggregateknowledge/postgresql-hll

https://github.com/citusdata/postgresql-hll (compatible with the PostgreSQL 10 header file)

Use INSERT ON CONFLICT Together with UDFs to Simplify the SQL Complexity of Stream Computing

You can use INSERT ON CONFLICT together with UDFs to greatly simplify the SQL complexity.

Example:

You can use function interfaces to write data without having to type out long SQL statements.

Design of Logs Plus Real-Time Computing

If detail data must be recorded, you must implement real-time statistics concurrently. You can use the RULE function for this design.

For incremental writing, you can use WHERE to filter out unnecessary writing (points).

When this method is used, the data is written to log_table and streaming statistics XXXXX COMMAND is collected at the same time.

Original Source