Performance Evaluation Methods for Distributed MPP Databases — Best Practice for PostgreSQL

Image for post
Image for post

Background

Evaluating the performance of a database usually involves either industry standard testing or or modeling testing based on the business model.

What Are the Hardware Indicators That Have the Most Influence on the Performance of a Database?

These indicators have the biggest influence on the performance of a database:

64*2*2400/8/1024= 37.5 GB/s
dmidecode --type 17

Array Handle: 0x0034
Error Information Handle: Not Provided
Total Width: 72 bits ## 带ECC, 64+8
Data Width: 72 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: CPU0_A0
Bank Locator: NODE 1
Type: DDR4
Type Detail:
Speed: 2400 MHz
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
Rank: 2
Configured Clock Speed: 2133 MHz
Memory speed    
#dd if=/dev/zero of=/dev/null bs=4k count=1024000000
^C68517474+0 records in
68517473+0 records out
280647569408 bytes (281 GB) copied, 34.1855 s, 8.2 GB/s

Block device speed
#dd if=/dev/Block device name of=/dev/null bs=4k count=102300000
^C2687957+0 records in
2687956+0 records out
11009867776 bytes (11 GB) copied, 4.6525 s, 2.4 GB/s

Example of Greenplum performance evaluation

We can simplify the evaluation model, as the CPU has significant effect (e.g. LLVM, vector optimization, or other optimizations) on the results. Here, I choose to ignore the deviation introduced by the CPU. We will not take into account data skew either.

1. Environment

We will discuss how to conduct a performance evaluation with the following environment as an example.

2. Performance indicators

Performance indicators obtained in another environment

postgres=# create table mmtest(id int) with (appendonly=true, blocksize=2097152, ORIENTATION=COLUMN);  
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

postgres=# insert into mmtest select generate_series(1,100000);
INSERT 0 100000
insert into mmtest select * from mmtest ;
...
postgres=# insert into mmtest select * from mmtest ;
INSERT 0 409600000

postgres=# select pg_size_pretty(pg_total_relation_size('mmtest'));
pg_size_pretty
----------------
3133 MB
(1 row)

postgres=# select count(*) from mmtest ;
count
-----------
819200000
(1 row)

Time: 779.444 ms

postgres=# select * from mmtest where id=0;
id
----
(0 rows)

Time: 422.538 ms
postgres=# create table mmtest1(id int)  
postgres-# ;
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
Time: 273.659 ms
postgres=# insert into mmtest1 select * from mmtest;

postgres=# select pg_size_pretty(pg_total_relation_size('mmtest1'));
pg_size_pretty
----------------
28 GB
(1 row)

postgres=# select count(*) from mmtest1 ;
count
-----------
819200000
(1 row)

Time: 1171.229 ms

postgres=# select * from mmtest1 where id=0;
id
----
(0 rows)
Time: 452.582 ms
create unlogged table mmtest(id int);  
postgres=# insert into mmtest select generate_series(1,100000);
INSERT 0 100000
insert into mmtest select * from mmtest ;
...
postgres=# insert into mmtest select * from mmtest ;
INSERT 0 409600000

postgres=# select pg_size_pretty(pg_total_relation_size('mmtest'));
pg_size_pretty
----------------
28 GB
(1 row)

postgres=# select * from mmtest where id=0;
id
----
(0 rows)

Time: 56410.222 ms (00:56.410)

32 Parallel computing
3.02 seconds
Image for post
Image for post

3. Query performance evaluation

a. Data scanning time

4. Data import performance evaluation

a. Insert a single commit

5.Data redistribution performance evaluation

Data redistribution time evaluation

6. Data vacuum full (redistribute) performance evaluation

a. vacuum full

References:

Verification of optimizer cost factors — aligning cost constants to timestamps in PostgreSQL

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store