Behavior and Audit Log Modeling: PostgreSQL Best Practice (1)

Alibaba Cloud
14 min readJun 5, 2019

By Digoal

Many systems retain users’ behavior logs, which include information like browsing behaviors, social behaviors, and operation behaviors. Typical application scenarios include SQL database auditing and behavior auditing of bastion hosts within enterprises.

The number of behavior logs and audit logs is related to the business volume and the number of operations. To meet the real-time query requirements of enterprises, it is usually necessary to build search engines, for example, by using the full-text search feature in ES or PostgreSQL.

Using PostgreSQL to build search engines has some advantages and can meet several requirements:

1. Store details. In addition to fields that need indexing, the Details field can also be stored in PostgreSQL.

2. Create an index, that is, create a full-text index in the log behavior field.

3. Create multidimensional indexes. In addition to an index in the log behavior field, other fields can also be indexed, such as indexes for the time dimension and the property dimension. These indexes can be combined to meet multidimensional search requirements.

4. Search for information in real time. Synchronization with search engines is not required.

Example

Testing Machine

Use a large and inexpensive SATA disk and use an SSD as a BCACHE cache.

12 * N TB

Create a directory in each disk.

/disk[1-12]

Operating System Configuration

Skipped

Compile PostgreSQL

wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2          

tar -jxvf postgresql-snapshot.tar.bz2
cd postgresql-10beta1

export USE_NAMED_POSIX_SEMAPHORES=1
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10_8k --with-blocksize=8 --with-wal-segsize=1024
LIBS=-lpthread CFLAGS="-O3" make world -j 128
LIBS=-lpthread CFLAGS="-O3" make install-world

Environment Variables

vi env_pg10.sh           

export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=$(($1+1920))
export PGDATA=/disk$1/digoal/pgdata/pg_root$(($1+1920))
export LANG=en_US.utf8
export PGHOME=/home/digoal/pgsql10_8k
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=127.0.0.1
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi

Initialize Clusters

Create 12 database clusters corresponding to each disk to fully utilize the disk I/O.
vi init.sh

for ((i=1;i<13;i++))          
do
. ~/env_pg10.sh $i
initdb -D $PGDATA -E SQL_ASCII --locale=C -U postgres
echo "local all all trust" > $PGDATA/pg_hba.conf
echo "host all all 127.0.0.1/32 trust" >> $PGDATA/pg_hba.conf
echo "host all all ::1/128 trust" >> $PGDATA/pg_hba.conf
echo "host all all 0.0.0.0/0 trust" >> $PGDATA/pg_hba.conf
done
. ./init.sh

Configure Parameters

postgresql.auto.conf          

# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
listen_addresses = '0.0.0.0'
max_connections = 400
unix_socket_directories = '.'
shared_buffers = 32GB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 500
bgwriter_lru_multiplier = 5.0
bgwriter_flush_after = 0
max_parallel_workers_per_gather = 0
fsync=on
synchronous_commit = off
full_page_writes = off
wal_buffers = 128MB
wal_writer_delay = 2s
wal_writer_flush_after = 0
checkpoint_timeout = 10min
max_wal_size = 32GB
min_wal_size = 4GB
checkpoint_completion_target = 0.5
checkpoint_flush_after = 128kB
max_wal_senders = 0
random_page_cost = 1.5
effective_cache_size = 48GB
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 500000000
autovacuum_multixact_freeze_max_age = 600000000
vacuum_freeze_table_age = 400000000
vacuum_multixact_freeze_table_age = 400000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
autovacuum_work_mem = 2GB
autovacuum_max_workers = 6
autovacuum_naptime = 20s
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.2
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 0
zero_damaged_pages = on
cp postgresql.auto.conf /disk1/digoal/pgdata/pg_root1921/
...
...
cp postgresql.auto.conf /disk12/digoal/pgdata/pg_root1932/

Start Clusters

Bind database instances to different CPU cores

vi start.sh          

for ((i=1;i<13;i++))
do
. /home/digoal/env_pg10.sh $i

if [ $i -eq 12 ]; then
numactl --physcpubind=56-63 pg_ctl start -o "-c port=$PGPORT"
else
numactl --physcpubind="$((($i-1)*(5)+(1)))"-"$(($i*5))" pg_ctl start -o "-c port=$PGPORT"
fi

done
. ./start.sh

Stop Clusters

vi stop.sh           

for ((i=1;i<13;i++))
do
. /home/digoal/env_pg10.sh $i

pg_ctl stop -m fast

done
. ./stop.sh

Modeling

Table Structure

Four fields: Storage PK (corresponding to original detail data PKs), Time, User ID, and User Behavior (tsvector field).

Retrieval Requirements

Data can be retrieved by time interval, user ID, and word-breaking criteria.

Retention Period

A time period after which retained logs will be cleaned (for example, one month).

Partition

Create multiple partitioned tables in each cluster (for example, 12 partitioned tables are used in this example).

If practical, we recommend that you create a partitioned table every hour so that you can query data within a specific time interval simply by using the partitions, without having to create time indexes.

If a single user has a very large amount of data, we recommend that you also create hash or LIST partitions by UID in order to query by UID instead of using indexes. (This eliminates the need to create an index in the UID field or even the need to store the UID field itself.)

Indexes

Use full-text indexes in the behavior field.

Use B-tree indexes in the user ID field.

Use BRIN indexes in the time field.

Simulate Data

Time (generated in a time series).

User ID (randomly generated in a specific range)

User behavior data (strings about 512 characters long, split into several tokens, for example, 40 tokens of different lengths in this example).

Initialize Data Tables

vi init.sql          

drop table bptest cascade;
do language plpgsql
$$

declare
begin
for i in 1..12 loop
execute 'drop table bptest'||i||' cascade';
end loop;
end;
$$
;

create table bptest(pk serial8, uid int, ts int8, content tsvector);
create index idx_bptest_content on bptest using gin (content) with (fastupdate=on,gin_pending_list_limit=2048);
create index idx_bptest_uid on bptest(uid);
create index idx_bptest_ts on bptest using brin(ts);

do language plpgsql
$$

declare
begin
for i in 1..12 loop
execute 'create table bptest'||i||'(like bptest including all) inherits(bptest)';
end loop;
end;
$$
;

-- Generate random strings of arbitrary character length
CREATE OR REPLACE FUNCTION public.gen_rand_str(integer)
RETURNS text
LANGUAGE sql
STRICT
AS $function$
select string_agg(a[random()*6+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);
$function$;
psql -f ./init.sql -p 1921
...
...
psql -f ./init.sql -p 1932

Test

Pour the test data, for example, insert 200 million pieces of data into each table and 2.4 billion pieces of data into each database (around 6 TB). The total amount of inserted data is 28.8 billion pieces (around 72 TB).

Pour 10 pieces of data in each batch.

vi test.sh          

for ((i=1;i<13;i++))
do
echo "insert into bptest${i} (uid,ts,content) select random()*100000 , extract(epoch from now()), to_tsvector('english',gen_rand_str(512)) from generate_series(1,10);" > ~/test${i}.sql
done

for ((i=1;i<13;i++))
do
. /home/digoal/env_pg10.sh ${i}

for ((x=1;x<13;x++))
do
nohup pgbench -M prepared -n -r -P 3 -f ./test${x}.sql -c 1 -j 1 -t 200000000 >/tmp/bptest_${i}_${x}.log 2>&1 &
done
done


chmod 500 test.sh

The query test data is as follows. The test data is highly random, and the content of each record is about 40 elements, no longer than 512 characters in length.

select * from bptest1 limit 1;          

pk | 1
uid | 849185
ts | 1494928859
content | 'aaaefba':14 'acddcfd':39 'acdeeaadbffdbbecceb':50 'aceeedfbaefbdfcbd':59 'adbbeddbecfdcffaeedcedaeeddaeaaeebfbdcdcecfbbebfcebabceffbfdbfbfa':60
'adcdf':61 'aead':47 'afddf':70 'ba':8 'bae':37 'bbaacffbcafeffafefdf':38 'bbe':55 'bbecfdf':32 'bcbfd':27 'bdce':45 'bdeccbcdeaabefbeeebcdbfddd':19
'bed':17 'beedeadccbbbecbfcbf':44 'bfccaeddaddbc':2 'cafdfcf':5 'cbcacefaff':3 'cbcfc':52 'cbfef':63 'ccdcbedb':33 'ccdcd':20 'cd':6 'cfecfeeccabf':42
'cffb':15 'dabdfddeeabfdcefb':16 'dacdeecfbcefebfabeedfabbaccec':57 'daee':1 'daffcdffadddbaffd':68 'dbcddacefcd':9
'dbdbcbfadfffbdddaaabdcbcecdbecbbdecffbfcfecbbfebfebcadefecfceadaeffd':11 'dcdf':23 'dd':53 'ddec':31 'debdcdebfffebdbfdeefffbcfbccbececdbeaffffedfbefdcccbbccadedecfbeccccbbb':48
'deefaeeaabdbbdafcfcbeecc':71 'df':26 'dfcbbcd':46 'e':7,51,56 'eafddcaac':43 'ecbaffa':21 'ecdeeceddbdcbfcabdc':10 'ecedcec':41 'ed':66
'edcbaecfcdfbcbcdedeebdbfceeeececfac':35 'eeca':25 'eeebafeacfebfdbdbddaacabebabbfbfdefeddefccfbeaefdbf':29 'eefdbfcadebcbbfffaefcaecafbddbdbfcf':13
'ef':58 'efbdc':67 'efccdddaebfbdaffcdfcbfdcbdeb':54 'efccebdddededdeda':64 'effcbfdfdeebfbbcfaabfd':12 'f':24,28,65 'fbbfccfcbcba':30 'fc':4
'fcbbdbbaefcefefdf':34 'fd':18 'fdffcbe':69 'fea':62 'feeabdcd':36 'feeadcedecedebaedccffbfddadcfececbefddcbeaedbebfadefedcbd':22 'feffceceaeec':49 'ffaffde':40

Launch a full-text retrieval request, enter four query criteria, and return PKs by stream.

We recommend that you use the streaming return interface because the result set may be very large.

select pk from bptest1 where uid=$1 and ts between $2 and $3 and content @@ to_tsquery('english', $4);

Stress testing

./test.sh

Resource Usage

dstat

The CPU is basically exhausted and the disks are substantially occupied after data is written.

User data consumes most of the CPU overhead (you can use perf to check the performance later).

dstat          

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
92 7 1 0 0 1|1075M 2435M|4048B 2297B| 0 0 | 142k 167k
92 7 0 0 0 1|1137M 2075M|2391B 1945B| 0 0 | 135k 161k
91 8 0 0 0 0|1182M 2125M|3483B 2845B| 0 0 | 140k 166k
91 8 0 0 0 1|1193M 1971M|3788B 1633B| 0 0 | 135k 159k
91 8 0 0 0 0|1089M 2305M|2232B 1478B| 0 0 | 139k 159k
92 7 0 0 0 1| 986M 2795M|2176B 1568B| 0 0 | 127k 142k
92 7 1 0 0 0| 760M 2864M|6028B 1408B| 0 0 | 116k 118k
90 8 0 0 0 0|1029M 3057M|1565B 2116B| 0 0 | 132k 150k
90 9 1 0 0 1|1000M 3237M|2336B 4850B| 0 0 | 133k 154k
90 8 1 1 0 1| 659M 4399M|2872B 7992B| 0 0 | 115k 119k
91 7 0 0 0 1| 925M 2996M|1293B 1059B| 0 0 | 122k 127k
90 8 1 1 0 1| 996M 3350M| 664B 574B| 0 0 | 133k 148k
91 7 1 0 0 1| 948M 2927M|3525B 2500B| 0 0 | 132k 146k
90 8 0 1 0 0|1114M 2869M|1751B 2645B| 0 0 | 132k 150k
90 8 0 1 0 1|1267M 2408M|3003B 2244B| 0 0 | 137k 167k
91 8 0 1 0 1|1086M 2539M| 900B 347B| 0 0 | 133k 154k
91 8 0 0 0 1| 998M 2614M|1975B 1757B| 0 0 | 130k 151k
91 8 0 0 0 0|1120M 2150M|1466B 4911B| 0 0 | 130k 154k
92 7 0 0 0 0|1163M 2387M|1356B 498B| 0 0 | 136k 163k
90 8 1 1 0 1| 864M 2656M|2601B 3373B| 0 0 | 130k 143k
91 8 0 0 0 1| 987M 2651M|2052B 898B| 0 0 | 135k 154k
91 8 0 0 0 0|1073M 2205M|2479B 2319B| 0 0 | 130k 144k
90 8 1 1 0 1| 951M 2941M|1390B 1001B| 0 0 | 130k 148k

Disk Usage

iostat -x           

avg-cpu: %user %nice %system %iowait %steal %idle
85.29 0.49 9.89 1.90 0.00 2.43

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 92.40 99.40 389.00 2822.40 136435.20 285.13 2.25 4.60 0.65 31.64
sdc 0.00 33.60 154.20 211.60 4838.40 85700.80 247.51 1.14 3.10 0.63 22.96
sdd 0.00 63.00 232.40 238.40 7316.80 109648.00 248.44 2.17 4.61 0.99 46.54
sde 0.00 78.80 269.60 340.80 7980.80 102419.20 180.87 2.53 4.14 0.94 57.62
sdf 0.00 58.40 283.00 234.20 8204.80 99129.60 207.53 2.30 4.45 0.93 48.20
sdg 0.00 50.80 207.60 236.60 6652.80 94337.60 227.35 1.42 3.19 0.68 30.34
sdh 0.00 102.20 109.40 475.20 3489.60 131211.20 230.42 2.60 4.45 0.52 30.40
sdi 0.00 70.20 107.00 337.00 3228.80 79603.20 186.56 1.35 3.04 0.53 23.38
sdj 0.00 31.00 70.60 158.80 2534.40 85124.80 382.12 0.82 3.59 0.86 19.72
sdk 0.20 58.40 190.60 295.80 5587.20 123539.20 265.47 1.74 3.57 0.68 33.28
sdl 0.00 91.00 162.80 396.40 4441.60 119507.20 221.65 1.98 3.54 0.59 33.26
sdm 0.00 274.80 103.20 359.20 2296.00 158908.80 348.63 3.81 8.23 1.08 50.06

perf

The postgres process consumes most of the overhead.

top -ag          

PerfTop: 9171 irqs/sec kernel:63.7% exact: 0.0% [1000Hz cycles], (all, 64 CPUs)
----------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ ___________________________ _______________________________________

23044.00 4.5% tsCompareString /home/digoal/pgsql10_8k/bin/postgres
19821.00 3.9% ExecInterpExpr /home/digoal/pgsql10_8k/bin/postgres
12258.00 2.4% gintuple_get_key /home/digoal/pgsql10_8k/bin/postgres
12208.00 2.4% pg_detoast_datum_packed /home/digoal/pgsql10_8k/bin/postgres
11111.00 2.2% hash_search_with_hash_value /home/digoal/pgsql10_8k/bin/postgres
10318.00 2.0% memcpy /lib64/libc-2.12.so
9078.00 1.8% AllocSetAlloc /home/digoal/pgsql10_8k/bin/postgres
8944.00 1.7% advance_aggregates /home/digoal/pgsql10_8k/bin/postgres
8547.00 1.7% cmpEntryAccumulator /home/digoal/pgsql10_8k/bin/postgres
7311.00 1.4% array_seek /home/digoal/pgsql10_8k/bin/postgres
6744.00 1.3% gin_cmp_tslexeme /home/digoal/pgsql10_8k/bin/postgres
6650.00 1.3% __closure_wake_up [bcache]
6550.00 1.3% appendBinaryStringInfo /home/digoal/pgsql10_8k/bin/postgres
6475.00 1.3% TParserGet /home/digoal/pgsql10_8k/bin/postgres
5578.00 1.1% ginFindLeafPage /home/digoal/pgsql10_8k/bin/postgres
5543.00 1.1% PyParser_AddToken /lib64/libpython2.7.so.1.0
5412.00 1.1% array_get_element /home/digoal/pgsql10_8k/bin/postgres
5355.00 1.0% heap_fill_tuple /home/digoal/pgsql10_8k/bin/postgres
4936.00 1.0% entryLocateLeafEntry /home/digoal/pgsql10_8k/bin/postgres
4732.00 0.9% heap_form_minimal_tuple /home/digoal/pgsql10_8k/bin/postgres
4512.00 0.9% rb_insert /home/digoal/pgsql10_8k/bin/postgres

top

top -c -u digoal          

top - 19:20:47 up 179 days, 5:38, 8 users, load average: 183.79, 189.01, 166.41
Tasks: 2939 total, 159 running, 2780 sleeping, 0 stopped, 0 zombie
Cpu(s): 87.7%us, 8.9%sy, 0.9%ni, 0.9%id, 1.2%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 529321828k total, 512395020k used, 16926808k free, 299780k buffers
Swap: 0k total, 0k used, 0k free, 482162560k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11258 digoal 20 0 20.7g 2.9g 2.9g R 78.7 0.6 5:11.98 postgres: postgres postgres 127.0.0.1(52848) INSERT
11253 digoal 20 0 20.7g 3.0g 2.9g R 70.7 0.6 5:14.48 postgres: postgres postgres 127.0.0.1(52843) INSERT
11264 digoal 20 0 20.7g 3.0g 2.9g R 68.1 0.6 5:14.63 postgres: postgres postgres 127.0.0.1(52854) INSERT
11263 digoal 20 0 20.7g 3.0g 2.9g R 65.2 0.6 5:14.89 postgres: postgres postgres 127.0.0.1(52853) INSERT
11250 digoal 20 0 20.7g 3.0g 3.0g R 59.4 0.6 5:16.66 postgres: postgres postgres 127.0.0.1(52840) idle
11268 digoal 20 0 20.7g 2.9g 2.9g R 53.3 0.6 5:11.36 postgres: postgres postgres 127.0.0.1(52858) INSERT
11266 digoal 20 0 20.7g 3.0g 3.0g R 52.3 0.6 5:09.00 postgres: postgres postgres 127.0.0.1(52856) INSERT
11068 digoal 20 0 20.7g 2.6g 2.6g R 51.4 0.5 4:44.47 postgres: postgres postgres 127.0.0.1(45823) INSERT
11097 digoal 20 0 20.7g 2.6g 2.5g R 49.5 0.5 4:47.85 postgres: postgres postgres 127.0.0.1(45839) INSERT
11161 digoal 20 0 20.7g 2.6g 2.6g R 49.5 0.5 4:47.87 postgres: postgres postgres 127.0.0.1(44881) INSERT
11256 digoal 20 0 20.7g 3.0g 3.0g R 49.5 0.6 5:14.69 postgres: postgres postgres 127.0.0.1(52846) INSERT
10819 digoal 20 0 20.7g 2.7g 2.7g R 48.8 0.5 4:58.17 postgres: postgres postgres 127.0.0.1(47342) INSERT
11107 digoal 20 0 20.7g 2.7g 2.7g R 48.8 0.5 5:02.00 postgres: postgres postgres 127.0.0.1(59612) INSERT
11255 digoal 20 0 20.7g 3.0g 3.0g R 48.2 0.6 5:15.68 postgres: postgres postgres 127.0.0.1(52845) INSERT
11267 digoal 20 0 20.7g 3.0g 3.0g R 47.9 0.6 5:18.82 postgres: postgres postgres 127.0.0.1(52857) INSERT
11066 digoal 20 0 20.7g 2.6g 2.6g R 46.9 0.5 4:44.97 postgres: postgres postgres 127.0.0.1(45821) INSERT
11222 digoal 20 0 20.7g 2.6g 2.6g R 45.9 0.5 5:00.43 postgres: postgres postgres 127.0.0.1(40379) idle
11207 digoal 20 0 20.7g 2.6g 2.6g R 45.6 0.5 5:04.59 postgres: postgres postgres 127.0.0.1(46160) INSERT
11224 digoal 20 0 20.7g 2.6g 2.6g R 45.3 0.5 5:02.60 postgres: postgres postgres 127.0.0.1(40381) INSERT
11249 digoal 20 0 20.7g 2.6g 2.6g R 45.3 0.5 4:59.58 postgres: postgres postgres 127.0.0.1(46187) INSERT
11045 digoal 20 0 20.7g 2.6g 2.6g R 44.6 0.5 4:39.75 postgres: postgres postgres 127.0.0.1(64424) idle
11064 digoal 20 0 20.7g 2.6g 2.6g R 44.6 0.5 4:44.69 postgres: postgres postgres 127.0.0.1(45819) INSERT
11145 digoal 20 0 20.7g 2.6g 2.6g S 44.3 0.5 4:46.18 postgres: postgres postgres 127.0.0.1(44876) INSERT
10865 digoal 20 0 20.7g 2.6g 2.6g R 44.0 0.5 4:59.89 postgres: postgres postgres 127.0.0.1(49769) INSERT
11080 digoal 20 0 20.7g 2.6g 2.5g R 44.0 0.5 4:43.70 postgres: postgres postgres 127.0.0.1(45825) INSERT
11247 digoal 20 0 20.7g 2.6g 2.6g R 43.4 0.5 5:01.91 postgres: postgres postgres 127.0.0.1(40391) idle
11163 digoal 20 0 20.7g 2.6g 2.6g R 42.7 0.5 4:48.34 postgres: postgres postgres 127.0.0.1(44882) idle
11164 digoal 20 0 20.7g 2.6g 2.6g R 42.4 0.5 4:53.21 postgres: postgres postgres 127.0.0.1(44883) INSERT
10882 digoal 20 0 20.7g 2.6g 2.6g R 41.8 0.5 5:04.78 postgres: postgres postgres 127.0.0.1(49772) INSERT
10868 digoal 20 0 20.7g 2.6g 2.6g R 41.4 0.5 5:00.30 postgres: postgres postgres 127.0.0.1(49770) INSERT

TPS of Data Writing

The performance of writing data is about 65,000 rows per second on a single machine.

progress: 729.0 s, 55.0 tps, lat 11.610 ms stddev 4.836          
progress: 732.0 s, 59.7 tps, lat 20.071 ms stddev 107.984
progress: 735.0 s, 57.0 tps, lat 20.492 ms stddev 125.445
progress: 738.0 s, 38.7 tps, lat 25.891 ms stddev 154.607
progress: 741.0 s, 41.0 tps, lat 24.405 ms stddev 140.247
progress: 744.0 s, 43.0 tps, lat 13.550 ms stddev 10.448
progress: 747.0 s, 60.0 tps, lat 20.691 ms stddev 131.640
progress: 750.0 s, 60.0 tps, lat 17.394 ms stddev 83.385
progress: 753.0 s, 44.3 tps, lat 25.510 ms stddev 146.719
progress: 756.0 s, 25.0 tps, lat 39.819 ms stddev 213.642
progress: 759.0 s, 50.0 tps, lat 11.439 ms stddev 5.319
progress: 762.0 s, 60.0 tps, lat 20.979 ms stddev 106.782
progress: 765.0 s, 60.0 tps, lat 18.778 ms stddev 167.714
progress: 768.0 s, 58.0 tps, lat 18.017 ms stddev 99.949
progress: 771.0 s, 51.0 tps, lat 19.636 ms stddev 124.429

The performance of writing data basically depends on the number of elements in the tsvector field and hash. Each record has 40 elements in this example. If the number of elements is reduced by 50%, the performance will be about twice as good.

postgres=# select array_length(tsvector_to_array(content),1) from bptest1 limit 10;        
array_length
--------------
40
37
40
45
35
42
38
46
30
40
(10 rows)

Evaluate How Many Index Entries Are Created per Second

1. Full-text search index entries

Each record has about 40 elements. Around 2.6 million index entries will be created per second in the event of 65,000 TPS.

65,000*40 = 2,600,000

2. The number of UID index entries is negligible.

3. TS index entries use a BRIN index and are negligible.

Creating full-text index entries has the most significant impact on the performance and consumes the most resources.

Query Performance

Example

postgres=# \dt+ bptest1      
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+---------+-------------
public | bptest1 | table | postgres | 1689 MB |
(1 row)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest1 where ts between 1494999617 and 1495999617 and content @@ to_tsquery ('english','abc');
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.bptest1 (cost=175.95..23691.41 rows=20015 width=811) (actual time=7.017..23.376 rows=19755 loops=1)
Output: uid, ts, content
Recheck Cond: (bptest1.content @@ '''abc'''::tsquery)
Filter: ((bptest1.ts >= 1494999617) AND (bptest1.ts <= 1495999617))
Heap Blocks: exact=18933
Buffers: shared hit=18948
-> Bitmap Index Scan on bptest1_content_idx (cost=0.00..170.94 rows=20019 width=0) (actual time=3.811..3.811 rows=19755 loops=1)
Index Cond: (bptest1.content @@ '''abc'''::tsquery)
Buffers: shared hit=15
Planning time: 0.097 ms
Execution time: 24.517 ms
(11 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest1 where ts between 1494999617 and 1495999617 and content @@ to_tsquery ('english','abc & bc');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.bptest1 (cost=36.27..2598.42 rows=1996 width=811) (actual time=4.577..6.711 rows=2125 loops=1)
Output: uid, ts, content
Recheck Cond: (bptest1.content @@ '''abc'' & ''bc'''::tsquery)
Filter: ((bptest1.ts >= 1494999617) AND (bptest1.ts <= 1495999617))
Heap Blocks: exact=2116
Buffers: shared hit=2239
-> Bitmap Index Scan on bptest1_content_idx (cost=0.00..35.77 rows=1997 width=0) (actual time=4.291..4.291 rows=2125 loops=1)
Index Cond: (bptest1.content @@ '''abc'' & ''bc'''::tsquery)
Buffers: shared hit=123
Planning time: 0.125 ms
Execution time: 6.849 ms
(11 rows)

Performance of Data Writing to SSD with fsync=on

  1. Data writing testing per second

Up to 70,000 records are written per second, and around 2.8 million full-text index entries are created per second.

The performance is relatively smooth.

Performance of Data Writing to SATA and SSD Bcache with fsync=off

  1. Data writing testing per second

Up to 75,000 records are written per second, and around 3 million full-text index entries are created per second.

The performance is relatively smooth.

Summary

1. Query aggregation

We recommend that you use plproxy to aggregate queries because log data is distributed among multiple clusters and tables.

2. Write to shards

Writing to shards can be completed in the business layer to write data in a scrambled manner.

In actual application scenarios, you can create more partitions as needed.

3. The majority of the overhead is incurred on Postgres. To perform detailed analysis, re-compile Postgres

4. GIN index optimization

https://www.postgresql.org/docs/9.6/static/sql-createindex.html

GIN indexes accept different parameters:          

1. fastupdate

This setting controls usage of the fast update technique described in Section 63.4.1.
It is a Boolean parameter: ON enables fast update, OFF disables it.
(Alternative spellings of ON and OFF are allowed as described in Section 19.1.) The default is ON.

Note: Turning fastupdate off via ALTER INDEX prevents future insertions from going into the list of pending index entries,
but does not in itself flush previous entries.

You might want to VACUUM the table or call gin_clean_pending_list function afterward to ensure the pending list is emptied.

2. gin_pending_list_limit

Custom gin_pending_list_limit parameter.
This value is specified in kilobytes.

The purpose of gin_pending_list_limit is to delay the merging action, because a record may involve too many GIN keys. If updated in real time, a large number of GIN indexes need to be written, influencing the performance.

In this example, gin_pending_list_limit is set to 2 MB, and the TPS is relatively stable. If a larger value is set for gin_pending_list_limit, significant fluctuation will occur in the event of insufficient CPU resources.

You can set a proper gin_pending_list_limit value based on specific test requirements.

5. If PostgreSQL is used entirely as an index database and data loss can be tolerated, you can set the fsync switch to off. (fsync on checkpoints has big impact on the I/O performance. Because a SATA disk is used in this example, significant performance fluctuation will occur if fsync is on.)

postgresql.auto.conf      

fsync = off
zero_damaged_pages = on

The risk of data loss is much lower if databases are designed to have high availability. (However, after a server crash, rebuilding a standby database is not easy due to the massive amounts of data.)

We recommend that you use more database instances and make the size of each instance controllable (for example, smaller than 2 TB), so that the time for rebuilding a standby database is relatively manageable.

6. In order to achieve better response time (RT), we recommend that you store detail data and indexes separately, because writing detail data requires short RT while indexes can tolerate some latency. In addition, the reliability requirements for detail data and indexes are different.

Reference:https://www.alibabacloud.com/blog/behavior-and-audit-log-modeling-postgresql-best-practice-1_594865?spm=a2c41.12952233.0.0

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com