Analyzing 1TB Table from Any Dimensions in Seconds with RDS PostgreSQL

Scenario Design

  1. Table structure
  • create table tbl ( id int8, -- Sequence tag1 int[], -- Array c1 int, -- 1–100 c2 int, -- 1–10000 c3 timestamp -- Timestamp );
  1. Create a function to generate random values
  • Value range $1-$2, taking an array of $3 random values create or replace function gen_rand_ints(int, int, int) returns int[] as $$ select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3)); $$ language sql strict; postgres=# select gen_rand_ints(10,25,5); gen_rand_ints ------------------ {20,19,24,22,21} (1 row)
  1. Write test data
  • -- Write a hot spot array of 50 million values insert into tbl select id, gen_rand_ints(1,1000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,50000000) t(id); -- Write non-hot spot arrays of 100 million values insert into tbl select id, gen_rand_ints(1,1000000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,100000000) t(id);
  1. The data model is as follows
  • postgres=# select * from tbl limit 10; id | tag1 | c1 | c2 | c3 ----------+-------------------------------------------+----+------+---------------------------- 38931521 | {424,448,91,420,382,657,677,60,530,503} | 59 | 6120 | 2017-09-11 14:32:06.610512 38931522 | {66,87,468,207,79,780,307,714,520,149} | 44 | 7848 | 2017-09-11 14:32:06.610522 38931523 | {99,628,798,558,415,74,863,839,522,953} | 26 | 9032 | 2017-09-11 14:32:06.610531 38931524 | {610,935,962,140,438,551,752,503,636,220} | 71 | 7136 | 2017-09-11 14:32:06.61054 38931525 | {998,16,428,518,164,868,303,263,496,102} | 82 | 9102 | 2017-09-11 14:32:06.61055 38931526 | {175,683,749,696,637,8,599,247,942,561} | 39 | 3796 | 2017-09-11 14:32:06.610559 38931527 | {112,138,882,747,356,591,461,355,605,888} | 87 | 7684 | 2017-09-11 14:32:06.610568 38931528 | {756,175,31,252,276,850,162,450,533,910} | 15 | 1691 | 2017-09-11 14:32:06.610578 38931529 | {917,744,416,860,306,801,240,416,937,122} | 16 | 2927 | 2017-09-11 14:32:06.610587 38931530 | {712,623,647,317,511,519,86,267,693,116} | 52 | 9676 | 2017-09-11 14:32:06.610596 (10 rows)
  1. Determine the TOP N elements of tag 1 under any conditions.
  2. Analyze the table to generate a bar chart.
  • postgres=# analyze tbl; ANALYZE
  1. Table size 16 GB.
  • postgres=# \dt+ tbl List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+-------+------------- public | tbl | table | postgres | 16 GB | (1 row)
  1. Accurately determine the TOP N elements under certain conditions. There are 1000 hot spot IDs, so the COUNT result of returning the TOP 10 elements is very similar. The TOP 10 elements obtained may not be so accurate when using value estimation, but they are surely among the 1000 IDs.
  • -- Query time with 32 parallel queries postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10; tag1 | count ------+------- 134 | 50935 768 | 50915 663 | 50876 567 | 50821 146 | 50821 332 | 50814 450 | 50807 884 | 50789 58 | 50781 605 | 50774 (10 rows) Time: 23441.247 ms (00:23.441) -- Query time without parallel queries postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10; tag1 | count ------+------- 134 | 50935 768 | 50915 663 | 50876 567 | 50821 146 | 50821 332 | 50814 450 | 50807 884 | 50789 58 | 50781 605 | 50774 (10 rows) Time: 154935.686 ms (02:34.936)
  1. Determine the sampling TOP N under the same conditions.
  2. For sampling methods, refer to the end of this article. PostgreSQL has two built-in common sampling methods, and two built-in extended sampling methods. In total, there are four built-in sampling methods.
  3. Use block-level sampling (currently, parallel sampling is not supported).
  • postgres=# select unnest(tag1) tag1, (count(*))*20 -- Multiplied by 100/sampling coefficient from (select * from tbl TABLESAMPLE system (5)) t where c1 between 1 and 10 group by 1 order by 2 desc limit 10; tag1 | ?column? ------+---------- 724 | 53380 798 | 52680 24 | 52640 371 | 52480 569 | 52400 531 | 52280 979 | 52160 429 | 52140 980 | 52080 350 | 51920 (10 rows) -- Sampling 5%, about 7 seconds. Time: 6887.745 ms (00:06.888) postgres=# select unnest(tag1) tag1, (count(*))*50 -- Multiplied by 100/sampling coefficient from (select * from tbl TABLESAMPLE system (2)) t where c1 between 1 and 10 group by 1 order by 2 desc limit 10; tag1 | ?column? ------+---------- 324 | 55450 435 | 55150 720 | 55050 943 | 54950 475 | 54750 958 | 54600 13 | 54400 742 | 54300 739 | 54100 301 | 53950 (10 rows) -- Sampling 2%, about 3 seconds. Time: 2720.140 ms (00:02.720)
  1. A larger number of samples leads to a higher accuracy. The TOP N elements are very accurate when using the sampling method, because the example uses 1000 random values, and the probability of each random value is the same. If it is required to return TOP 1000, then it will be 100% accurate.

Large Table Example

  1. Table structure design
  • create table tbl1 ( id int8, -- Sequence c1 int8, -- Target field c2 int8, -- 1–100 c3 int8, -- 1–100000 c4 timestamp -- Timestamp );
  1. Write test data
  • nohup psql -c "insert into tbl1 select id, 1, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 & nohup psql -c "insert into tbl1 select id, random()*(4-2)+2, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 & nohup psql -c "insert into tbl1 select id, random()*(10-5)+5, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 & nohup psql -c "insert into tbl1 select id, random()*(30-11)+11, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 & nohup psql -c "insert into tbl1 select id, random()*100000, random()*100, random()*100000, clock_timestamp() from generate_series(1,2000000000) t(id);" >/dev/null 2>&1 &
  1. Analyze table
  • postgres=# analyze tbl1; ANALYZE Time: 502.421 ms Table size: 254 GB postgres=# \dt+ tbl1 List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+--------+------------- public | tbl1 | table | postgres | 254 GB | (1 row)
  1. Accurate TOP 30
  • -- Query time with 32 parallel queries postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30; c1 | count ----+---------- 1 | 49991259 3 | 25006580 2 | 12502559 4 | 12498741 9 | 10004285 6 | 10002597 8 | 9999530 7 | 9999215 5 | 5003219 10 | 4998870 29 | 2636193 18 | 2635457 13 | 2635344 17 | 2634693 26 | 2633965 19 | 2633690 28 | 2633526 14 | 2633512 15 | 2633363 24 | 2633260 20 | 2633014 25 | 2632926 16 | 2632779 22 | 2632508 27 | 2632288 23 | 2632216 21 | 2631443 12 | 2631315 11 | 1318483 30 | 1318451 (30 rows) Time: 20845.738 ms (00:20.846) -- Query time without parallel queries postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30; c1 | count ----+---------- 1 | 49991259 3 | 25006580 2 | 12502559 4 | 12498741 9 | 10004285 6 | 10002597 8 | 9999530 7 | 9999215 5 | 5003219 10 | 4998870 29 | 2636193 18 | 2635457 13 | 2635344 17 | 2634693 26 | 2633965 19 | 2633690 28 | 2633526 14 | 2633512 15 | 2633363 24 | 2633260 20 | 2633014 25 | 2632926 16 | 2632779 22 | 2632508 27 | 2632288 23 | 2632216 21 | 2631443 12 | 2631315 11 | 1318483 30 | 1318451 (30 rows) Time: 471112.827 ms (07:51.113)
  1. Sampling TOP 30
  • select c1,(count(*))*20 from -- Multiplied by 100/sampling coefficient (select * from tbl1 TABLESAMPLE system (5)) t where c2 between 1 and 10 group by 1 order by 2 desc limit 30; c1 | ?column? ----+---------- 1 | 50068840 3 | 25108820 2 | 12558680 4 | 12513080 7 | 10009300 9 | 10006260 6 | 10005400 8 | 9987220 5 | 5008280 10 | 5007980 17 | 2652940 16 | 2648640 25 | 2646800 28 | 2646600 15 | 2642480 20 | 2642220 14 | 2641620 26 | 2640500 23 | 2639420 29 | 2637740 22 | 2637320 13 | 2636900 19 | 2636100 18 | 2635120 24 | 2634440 12 | 2631480 27 | 2629880 21 | 2624940 11 | 1330140 30 | 1316480 (30 rows) Time: 31884.725 ms (00:31.885) -- Sampling 5%, about 32 seconds. select c1,(count(*))*50 from -- Multiplied by 100/sampling coefficient (select * from tbl1 TABLESAMPLE system (2)) t where c2 between 1 and 10 group by 1 order by 2 desc limit 30; c1 | ?column? ----+---------- 1 | 50173200 3 | 24993550 2 | 12487100 4 | 12474100 6 | 9998250 8 | 9980450 7 | 9973950 9 | 9960450 10 | 4999050 5 | 4995000 29 | 2642700 28 | 2640900 16 | 2640300 26 | 2630250 24 | 2627500 23 | 2623700 19 | 2622350 27 | 2622000 18 | 2621200 12 | 2619450 20 | 2616200 17 | 2616050 21 | 2615800 15 | 2613200 22 | 2612200 14 | 2607700 13 | 2605900 25 | 2604150 30 | 1312300 11 | 1311950 (30 rows) Time: 12942.455 ms (00:12.942) -- Sampling 2%, about 13 seconds. postgres=# select c1,(count(*))*1000 from -- Multiplied by 100/sampling coefficient (select * from tbl1 TABLESAMPLE system (0.1)) t where c2 between 1 and 10 group by 1 order by 2 desc limit 30; c1 | ?column? ----+---------- 1 | 48077000 3 | 25061000 2 | 12762000 4 | 12262000 8 | 9851000 6 | 9789000 7 | 9718000 9 | 9654000 5 | 4971000 10 | 4885000 18 | 2731000 28 | 2727000 29 | 2710000 23 | 2697000 15 | 2687000 27 | 2681000 22 | 2672000 17 | 2672000 25 | 2670000 19 | 2637000 20 | 2632000 12 | 2628000 14 | 2628000 21 | 2622000 26 | 2618000 13 | 2601000 24 | 2522000 16 | 2513000 11 | 1406000 30 | 1301000 (30 rows) Time: 863.604 ms -- Sampling 0.1%, about 0.86 seconds.
  1. When the sampling one thousandth of the data (only about 254 MB of data is scanned), it takes less than one second to obtain the TOP 30 with high accuracy.
  2. Using this feature in Greenplum will be amazing, and it is possible to complete data pivoting for one trillion pieces of data from any dimensions within seconds.

Summary

Comparison of time-consumption for sampling and accurate query

Sampling computing achieves high accuracy while consuming very little resources

Efficiency evaluation of value estimation

Sampling methods

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Chronicle: A story of building technical documentation for GO-DATA

How to Merge Two Word Documents in Node.JS

Insert rows in BigQuery tables with complex columns

Listen to Android activities, you must. — Yodandroid (Part 1)

Convert MP3 Files to WAV in Swift

Convert a Scanned Image into Text in Python

Why DevOps is Essential in App Development

Convert Input Image to Photoshop (PSD) Format in Python

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

How can I manage my Kafka Artefacts?

Standardising Workflows & Crontabs with Airflow

How to setup Airflow Sensor’s mode as Reschedule

Undatum: command-line JSON lines/BSON data processing tool