Principles and Optimization of 5 PostgreSQL Indexes (btree,hash,gin,gist,and brin)

Image for post
Image for post

Background

Note: This article contains links to Chinese blog articles.
Precision marketing is a hot topic in the advertising industry. One of my previous articles describes how to use arrays and GIN indexes on PostgreSQL to tag people in real-time. (Source to Chinese article: Marketing to trillions of people in milliseconds — database design for real-time recommendations systems with trillions of user tags)

select * from user_tags where lab1 ? xxx and lab2 ? xxx or lab4 ? xxx;  

select xx, count(*) from user_tags where lab1 ? xxx and lab2 ? xxx or lab4 ? xxx group by xxx;

Modeling and testing

Create a TAG table

postgres=# create table tbl_label(uid int primary key, c1 int, c2 text, c3 numeric, c4 timestamp, c5 interval, c6 int);  
CREATE TABLE
Time: 31.145 ms
postgres=# insert into tbl_label select id,   
random()*10000, md5(random()::text),
10000*random(), clock_timestamp(),
(random()*1000::int||' hour')::interval,
random()*99999
from generate_series(1,10000000) t(id);
INSERT 0 10000000
postgres=# select * from tbl_label limit 10;  
uid | c1 | c2 | c3 | c4 | c5 | c6
-----+------+----------------------------------+------------------+----------------------------+------------------+-------
1 | 1692 | 505662aa4a6b33e1775cea660063ba58 | 9761.26249413937 | 2017-06-12 18:38:57.515097 | 322:06:55.266882 | 67699
2 | 8611 | a60d564b7f4d58029dfd5e16f0457305 | 1003.07232700288 | 2017-06-12 18:38:57.515282 | 780:59:39.081975 | 89283
3 | 290 | f226358e08372d4b79c8ecdd27172244 | 8240.20517989993 | 2017-06-12 18:38:57.515296 | 261:29:59.658099 | 87975
4 | 7829 | 32bc5d97731ddaf2c1630218e43d1e85 | 9061.87939457595 | 2017-06-12 18:38:57.515303 | 760:47:18.267513 | 76409
5 | 7735 | 3813b4bcdaadc21a55da143f6aceeac9 | 6651.74870751798 | 2017-06-12 18:38:57.515309 | 512:45:50.116217 | 11252
6 | 9945 | ff72917169cdea9225e429e438f22586 | 2114.50539063662 | 2017-06-12 18:38:57.515316 | 63:30:34.15014 | 33288
7 | 9144 | 7cf4067f22c5edbb1fc4e08ecee7242c | 5662.74457611144 | 2017-06-12 18:38:57.515322 | 890:30:28.360096 | 55484
8 | 2433 | 8ac9732bec2b1c175483c16e82467653 | 9184.17258188128 | 2017-06-12 18:38:57.515328 | 343:34:40.88581 | 53265
9 | 8113 | 2dd724e82dc7c2a15dfda45f6a41cd53 | 5094.92502082139 | 2017-06-12 18:38:57.515334 | 635:16:39.096908 | 63410
10 | 3893 | b8abdb00228f09efb04c1e2a8a022c22 | 6397.02362008393 | 2017-06-12 18:38:57.51534 | 295:26:24.752753 | 17894
(10 rows)
postgres=# analyze tbl_label ;  
ANALYZE
Description of n_distinct  

-1 indicates that each row in the column is unique.

>=1 indicates the number of unique values in the column

<1 indicates the number/total number of unique values in the column

Description of correlation
It indicates the linear correlation between this column and the data stack storage, where 1 indicates perfect positive correlation. As the value gets closer to 0, the data distribution is more discrete. <0 indicates an inverse correlation.

Uid is auto-increment, c4 increases by time, so the correlation is 1, a perfect positive correlation.

postgres=# select tablename,attname,n_distinct,correlation from pg_stats where tablename='tbl_label';
tablename | attname | n_distinct | correlation
-----------+---------+------------+-------------
tbl_label | uid | -1 | 1
tbl_label | c1 | 10018 | 0.00431651
tbl_label | c2 | -0.957505 | -0.00796595
tbl_label | c3 | -1 | 0.00308372
tbl_label | c4 | -1 | 1
tbl_label | c5 | -1 | 0.000382809
tbl_label | c6 | 100688 | 0.00156045
(7 rows)
postgres=# create extension btree_gin;  
CREATE EXTENSION
Create compound gin indexes for c1 and c6.
postgres=# set maintenance_work_mem ='32GB';
SET
Time: 0.168 ms
postgres=# create index idx_tbl_label_1 on tbl_label using gin(c1,c6);
CREATE INDEX
Time: 1504.542 ms (00:01.505)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100;  
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl_label (cost=125.76..8759.97 rows=10074 width=80) (actual time=40.856..50.480 rows=9922 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Recheck Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))
Heap Blocks: exact=7222
Buffers: shared hit=7982
-> Bitmap Index Scan on idx_tbl_label_1 (cost=0.00..123.24 rows=10074 width=0) (actual time=39.773..39.773 rows=9922 loops=1)
Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))
Buffers: shared hit=760
Planning time: 0.105 ms
Execution time: 51.043 ms
(10 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 or c6=100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl_label (cost=134.36..8799.70 rows=10085 width=80) (actual time=41.133..50.187 rows=9932 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Recheck Cond: (((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100)) OR (tbl_label.c6 = 100))
Heap Blocks: exact=7228
Buffers: shared hit=7992
-> BitmapOr (cost=134.36..134.36 rows=10085 width=0) (actual time=40.045..40.045 rows=0 loops=1)
Buffers: shared hit=764
-> Bitmap Index Scan on idx_tbl_label_1 (cost=0.00..123.24 rows=10074 width=0) (actual time=40.031..40.031 rows=9922 loops=1)
Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))
Buffers: shared hit=760
-> Bitmap Index Scan on idx_tbl_label_1 (cost=0.00..6.08 rows=11 width=0) (actual time=0.012..0.012 rows=10 loops=1)
Index Cond: (tbl_label.c6 = 100)
Buffers: shared hit=4
Planning time: 0.125 ms
Execution time: 50.758 ms
(15 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 and c6=100;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl_label (cost=22.50..24.02 rows=1 width=80) (actual time=36.193..36.193 rows=0 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Recheck Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))
Buffers: shared hit=763
-> Bitmap Index Scan on idx_tbl_label_1 (cost=0.00..22.50 rows=1 width=0) (actual time=36.190..36.190 rows=0 loops=1)
Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))
Buffers: shared hit=763
Planning time: 0.115 ms
Execution time: 36.226 ms
(9 rows)
postgres=# create index idx_tbl_label2 on tbl_label using btree(c2);  
CREATE INDEX
Time: 1388.756 ms (00:01.389)

postgres=# create index idx_tbl_label3 on tbl_label using btree(c3);
CREATE INDEX
Time: 1028.865 ms (00:01.029)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 and c6=100 and c2='abc';  
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_label2 on public.tbl_label (cost=0.42..3.45 rows=1 width=80) (actual time=0.032..0.032 rows=0 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Index Cond: (tbl_label.c2 = 'abc'::text)
Filter: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))
Buffers: shared read=3
Planning time: 0.248 ms
Execution time: 0.056 ms
(7 rows)

Explanations

I used GIN multi-column compound indexes in the above example, but there is actually another way around the issue. We can convert multiple columns into an array, and create array indexes (PostgreSQL expression indexes)

postgres=# create or replace function to_array(VARIADIC anyarray) returns anyarray as $$  
select $1;
$$ language sql strict immutable;
CREATE FUNCTION
Example
postgres=# select to_array('a'::text,'b','c');
to_array
----------
{a,b,c}
(1 row)

postgres=# select to_array(now(),clock_timestamp());
to_array
-------------------------------------------------------------------
{"2017-06-12 17:50:47.992274+08","2017-06-12 17:50:47.992489+08"}
(1 row)

postgres=# select to_array(1,2,3);
to_array
----------
{1,2,3}
(1 row)
create index idx_tbl_label_combin on tbl_label using gin (to_array(c1,c6));   

When we have a set of different types of columns, we can convert them into a single uniform type and then create expression indexes. To convert the columns, we may need to use the immutable function. If you don't have the immutable function, you can easily create one.

postgres=# create index idx_tbl_label_combin1 on tbl_label using gin (to_array('c1:'||c1,'c6:'||c6));
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where to_array(c1,c6) && array[1,2];  
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl_label (cost=840.56..86397.30 rows=99750 width=80) (actual time=0.777..4.030 rows=2254 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Recheck Cond: (ARRAY[tbl_label.c1, tbl_label.c6] && '{1,2}'::integer[])
Heap Blocks: exact=2242
Buffers: shared hit=2251
-> Bitmap Index Scan on idx_tbl_label_combin (cost=0.00..815.62 rows=99750 width=0) (actual time=0.465..0.465 rows=2254 loops=1)
Index Cond: (ARRAY[tbl_label.c1, tbl_label.c6] && '{1,2}'::integer[])
Buffers: shared hit=9
Planning time: 0.361 ms
Execution time: 4.176 ms
(10 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where to_array('c1:'||c1,'c6:'||c6) && array['c1:1'];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl_label (cost=422.00..54015.43 rows=50000 width=80) (actual time=0.331..1.888 rows=1021 loops=1)
Output: uid, c1, c2, c3, c4, c5, c6
Recheck Cond: (ARRAY[('c1:'::text || (tbl_label.c1)::text), ('c6:'::text || (tbl_label.c6)::text)] && '{c1:1}'::text[])
Heap Blocks: exact=1019
Buffers: shared hit=1024
-> Bitmap Index Scan on idx_tbl_label_combin1 (cost=0.00..409.50 rows=50000 width=0) (actual time=0.195..0.195 rows=1021 loops=1)
Index Cond: (ARRAY[('c1:'::text || (tbl_label.c1)::text), ('c6:'::text || (tbl_label.c6)::text)] && '{c1:1}'::text[])
Buffers: shared hit=5
Planning time: 0.173 ms
Execution time: 1.972 ms
(10 rows)

Summary

1.When to choose btree?

What are the queries supported by each index type?

1.btree

How to optimize index efficiency

The above explains how to select an index type, but does not mention how to optimize your indexes. In reality, the efficiency of an index is heavily reliant on data distribution.

References:

Introduction to the Alibaba Cloud RDS for PostgreSQL varbitx extension and real-time image applications

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store