PostgreSQL Database Design for Pivot Data Analysis

By Digoal

Apart from social networking sites and e-commerce websites, people tend to visit popular audio, video, image, and text content websites the most. For web developers and publishers, content management is very important, and data pivoting is an important tool for content management. Video websites are now capable of playback on various devices, such as mobile phones, computers, TV boxes, TVs, and projectors. This means that organizations need to keep track of data including device attributes, member attributes, channel attributes, and so on.

Business Requirements

1. Generate device/user profiles

IDs, multi-dimensional tags, multi-valued column tags (for example, movies with certain directors/actors that have been watched by users within the last 7 days/one month).

Generally, there will be tens of thousands of values for multivalued columns (for example, tens of thousands of actors/movies). A device/person generally has dozens or even hundreds of attribute columns. There may be dozens of VALUEs in a single multivalued column for a device.

2. Profile pivoting

2.1. Query for the number of targets (the number of selected devices/members) based on any combination of tag conditions

2.2. Select a group according to any combination of tag conditions, and count the proportion taken by each category of certain columns of the group (count, group by, quantile, multidimensional pivoting)

The concurrency requirement is low.

3. Select target devices and members

Pagination query for IDs that satisfy any combination of tag conditions (to select devices or members that satisfy the conditions)

The concurrency requirement is low.

4. Point query (key value query) requirements (check whether user meets the selection rules according to any combination of tag conditions and user IDs)

The concurrency requirement for point query is high, which may involve tens of thousands of requests per second.

Volume Estimation

For audio and video websites, there are typically less than a million pieces of content (however, after introducing short video clips or user-generated media, there may be billions of pieces of content).

There can’t be more than 10 billion users and devices (based on the world’s population). In addition, devices will age, and there won’t be more than a billion active devices.

Depending on people’s capacity for abstract thinking, the number of tag columns may be in the hundreds. Multivalued columns (such as favorite actors, movies, and directors) may make up a larger proportion, maybe 50%.

The VALUE range of multivalued columns (such as actors, movies, and directors) is expected to be in the millions. (Favorite variety star tag of user A: Wang Han, Zhang Yu, Liu Wei)

There may be dozens of multivalued column tags, among which “recently watched movies” is generally useful. Aside from porn identifiers, I’d guess no one watches movies all day long.

Key Point Analysis

1. Storage and pivot of multivalued columns

1.1. Storage of multivalued columns

PostgreSQL supports array type multivalued tag columns. Since they belong to a data type, they have built-in common array operators and functions, such as CONTAINS, INTERSECT, SYM_DIFFERENCE, OVERLAP, APPEND, UNNEST, and type conversion. They are really convenient to use.

https://www.postgresql.org/docs/9.6/static/functions-array.html?spm=a2c41.12784757.0.0

1.2. Multi-valued column pivoting:

For example, the “user’s favorite actor” is a multivalued column — determining the favorite TOP 10 actors of users in Zhejiang Province.

For a conditional TOP N query that requires actual computing, use unnest(array), for example

2. Time and spatial (geographical) dimension analysis

Members and devices have geographic attributes. We can add them into GIS pivoting. PostgreSQL supports geometry types, indexes, operators, aggregate functions, window functions, clustering functions, and so on. It is extremely powerful in terms of GIS processing.

3. Computing workload

Pivoting requires powerful computing capability. PostgreSQL 9.6 began to support multi-core parallelism. Currently, a 100,000 RMB machine generally supports a 10 GB/s standalone throughput, which reaches the level of HTAP.

If you want better analytic computing capabilities, you can use Alibaba Cloud HybridDB for PostgreSQL, which is a multi-machine parallel (OLAP) version of PostgreSQL. It supports column-store, row-store, compression, multi-machine parallelism and other features.

4. Data import

Usually, video systems do not require highly real-time user profiles. New user profiles are generated on a daily basis. You can simply write daily generated profiles into Alibaba Cloud RDS PostgreSQL or HybridDB for PostgreSQL to provide pivot information.

Architecture Design

There are two solutions that meet the requirements of both pivoting and highly concurrent point queries.

Solution 1: Alibaba Cloud RDS PG 9.4 + HDB PG

Import the profile data into RDS PG and the HDB PG through OSS using the ETL scheduling system.

Leave point query requirements to RDS PG 9.4.

Leave pivoting requirements to HDB PG.

Solution 2: Alibaba Cloud RDS PG 10

RDS PG 10 supports multi-core parallelism. A 64-core machine processes about 10 GB of data per second (the hard disk bandwidth is not fixed, and is usually about 1.6 GB/s for a single NVME card). It is easy to estimate the time of pivoting.

If your business can deal with a longer pivoting time, you could achieve your business goals by using RDS PG 10.

In addition, RDS PG 10 supports multi-index BITMAP SCAN combined scans, without scanning the entire table. It is very efficient.

Solution 1 Verification and Performance

Create a function that produces multi-valued columns randomly

Create a function to generate the multivalued columns we need.

Profiles table

1. Alibaba Cloud RDS PostgreSQL

2. Alibaba Cloud HybridDB for PostgreSQL

Data import

1. Write test values

2. Method for dynamically writing data into the corresponding partitions

PostgreSQL 10

3. Method of directly writing data to a single table

PostgreSQL 10

About 62,000 rows/s, performance depends on the size of each single row.

According to my testing, 320 million pieces of data takes up 977 GB space. If the size of each single row is smaller, the performance will be better.

4. Write into Alibaba Cloud HybridDB for PostgreSQL

5. You can choose to export data from PG to HDB. For example

6. Most of the time, we import data into HDB or RDS PG from OSS in the production environment.

Refer to: https://www.alibabacloud.com/help/doc-detail/44461.htm?spm=a2c41.12784757.0.0

Pivot Testing (Alibaba Cloud HybridDB for PostgreSQL)

1. The number of qualifying devices and members

1.1 HybridDB for PostgreSQL (column-store) direct computing

Equivalent query:

Array (multi-value type) INTERSECT (including ANY) query:

Equivalent AND array (multi-value type) INTERSECT (including ANY) query:

Equivalent OR array (multi-value type) INTERSECT (including ANY) query:

Array (multi-value type) CONTAINS (including ALL) query:

1.2 Value estimation using statistical information

Improper use of count and offset as well as the optimization of pagination

1.3 Value estimation using HLL

Greenplum best practices — use of the value estimation plug-in HLL (and HLL fraction aggregate function optimization)

2. Pivoting of qualified data multivalued columns and common columns (TOP N)

2.1 Pivoting of multi-valued columns:

2.2 Pivoting of scalar columns:

Selection Test (Alibaba Cloud HybridDB for PostgreSQL)

3. Select qualifying device IDs and user IDs

Use the cursor to select the target group, and each page is silky smooth.

Select a group of 30,000 people:

Select a group of 58 million people:

Select a group of 60,000 people:

Key-value query test (Alibaba Cloud RDS for PostgreSQL)

4. Point query according to device ID and user ID

(Create an ID index. Here we use BLOCK NUM for testing, the actual effect is the same)

Single response 1.1 milliseconds

The data size is 977 GB, which exceeds the memory size. If the entire network data is hot data, the disk read rate reaches 11 GB/s.

Performance Indicators

Environment specifications

HybridDB for PostgreSQL specification: 48C SSD.

RDS PostgreSQL specification: 60 cores.

Performance data

1. The number of qualifying devices and members

1.1 Equivalent query:

0.5 seconds

1.2 Array (multi-value type) INTERSECT (including ANY) query:

10 seconds

1.3 Equivalent AND array (multi-value type) INTERSECT (including ANY) query:

6 seconds

1.4 Equivalent OR array (multi-value type) INTERSECT (including ANY) query:

10 seconds

1.5 Array (multi-value type) CONTAINS (including ALL) query:

10 seconds

2. Pivoting of qualified data multivalued columns and common columns (TOP N)

2.1 Pivoting of multi-valued columns:

8 seconds

2.2 Pivoting of scalar columns:

1.3 seconds

3. Select qualifying device IDs and user IDs

Using cursors, each page has the same efficiency

3.1 Scalar conditions, select a group of 30,000 people:

Total time consumption 0.9 seconds

3.2 Multi-value conditions, select a group of 58.5 million people:

Total time consumption 24 seconds

3.3 Multi-value conditions, select a group of 60,000 people:

Total time consumption 9 seconds

4. Point query according to device ID and user ID

1.1 ms

Summary

Alibaba Cloud ApsaraDB RDS for PostgreSQL and HybridDB for PostgreSQL can efficiently meet the multi-dimensional (time, space (GIS), multivalued columns, single-valued columns) data pivoting, group selection, and point query requirements.

Reference:

https://www.alibabacloud.com/blog/postgresql-database-design-for-pivot-data-analysis_594691?spm=a2c41.12784757.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store