PostgreSQL Business Data Quality Real-time Monitoring Practices

Background

As a business system grows, individual business units have increasingly frequent data ingestion. However, this brings a new concern — data quality. For example, you need to decide whether some fields have been omitted from upstream data, whether upstream data can arrive in a timely manner, or whether upstream data itself has some problems.

These problems can be solved through business data quality monitoring.

The built-in statistics feature in PostgreSQL can meet the requirements of most business data quality monitoring scenarios.

If a higher level of business-based and custom data quality monitoring is required, PostgreSQL also provides features such as async batch consume with atomicity, stream computing, and asynchronous messaging to support real-time data quality monitoring.

A Built-In Feature That Allows Real-Time Business Data Quality Monitoring

The following section describes the built-in statistics feature in PostgreSQL:

1. Number of real-time records

2. Real-time column statistics (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high-frequency words, bar chart with evenly-distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, and bar chart for high-frequency elements)

See the following for a detailed description:

3. Real-time table statistics (such as the number of full table scans performed, the number of records scanned by using a full table scan, the number of index scans performed, the number of records scanned by using an index scan, the number of written records, the number of updated records, and the number of DEAD TUPLEs ).

4. Statistics analysis and scheduling policies

PostgreSQL automatically collects statistics based on changes in table records. The following shows the control of scheduling parameters:

The built-in statistics feature allows you to obtain the following information:

1. Number of real-time records

2. Information about each column (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high frequency words, bar chart with evenly distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, bar chart for high-frequency elements)

Business data quality can be monitored in real time with the preceding feedback information.

Example

1. Create a test table

2. Create stress testing scripts

3. Perform stress testing

4. Create a data cleaning schedule that retains data within 30 seconds.

Schedule the task once every 0.1 seconds

5. Monitor statistics in real time

Statistics per column

Number of records

DML activity statistics

Data Cleaning Schedule

Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without an impact on reading and writing data.

How to clean expired data efficiently by row number — non-partitioned tables: data aging practices

Around 2.63 million rows can be cleaned per second on a single instance.

How to Clean Statistics

How to Enforce Manual Statistics Collection

Customized and Real-Time Business Data Quality Monitoring

Use the async batch consume with atomicity method to monitor data quality in real time

Examples:

HTAP database PostgreSQL scenarios and performance tests — no. 32 (OLTP) high throughput data input/output (storage, row scanning, no indexes) — async batch consume with atomicity (JSON + functional stream computing)

HTAP database PostgreSQL scenarios and performance tests — no. 31 (OLTP) high throughput data input/output (storage, row scanning, no indexes) — async batch consume with atomicity (paralleled testing for high throughput read/write)

HTAP database PostgreSQL scenarios and performance tests — no. 27 (OLTP) IoT — FEED logs, stream computing, and async batch consume with atomicity (CTE)

PostgreSQL asynchronous message practice — feed system real-time monitoring and response like in proactive e-commerce services — minute level to millisecond level implementation

Data Cleaning Schedule

Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without any impact on reading and writing data.

How to clean expired data efficiently by row number — non-partitioned tables: data aging practices

Around 2.63 million rows can be cleaned per second on a single instance.

Reference:https://www.alibabacloud.com/blog/postgresql-business-data-quality-real-time-monitoring-practices_594816?spm=a2c41.12889500.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store