Creating a Real-Time User Profile Recommendation System with PostgreSQL (1)

Simulation Scenario

Let’s assume that there is a 2B real-time user recommendation system, and each APPID represents one business. Service data is comprised of the APPID, USERIDs, and TAGs. (2B user ID, end user IDs, and tags). The service does not need any data exchange operations across APPIDs, so it merely provides user recommendations within the APPID.

  • Include, not include, or, and.
  • Concurrency of a few hundred, on a RT millisecond scale.

Solution 1: Wide Table Solution

The width of a table is usually restricted. Taking PostgreSQL as an example, one record is not allowed to span pages (a variable-length field is stored in TOAST storage to store a column wider than a page, and only points are stored in the page), which restricts the table width.

create table t_tags_1(appid int, userid int8, tag1 boolean, tag2 boolean, ...... tag1000 boolean);  
.....
create table t_tags_10(appid int, userid int8, tag9001 boolean, tag9002 boolean, ...... tag10000 boolean);

Solution 2: Array Solution

An array is used to replace TAG columns. This requires the database to support arrays and possess high-efficiency retrieval capability. PostgreSQL fits this requirement perfectly.

APPID, USERID, TAG[] array
  • Include all the TAGs specified in array2. Array 1 includes all the elements in array 2. (It supports index search.) array1 @> array2
  • Include one of the TAGs specified in array2. Array 1 and array 2 have overlapping elements. (It supports index search.) array1 && array2
  • Do not include any TAGs specified in array2. Array 1 and array 2 do not have any overlapping elements. (It does not support index search.) not array1 && array2
create table t_arr(appid int, userid int8, tags int2[]) with(parallel_workers=128);
create index idx_t_array_tags on t_arr using gin (tags) with (fastupdate=on, gin_pending_list_limit= 1024000000);
create index idx_t_arr_uid on t_arr(userid);
insert into t_arr select 1, 2000000000*random(),(select array_agg(10000*random()) from generate_series(1,10000));nohup pgbench -M prepared -n -r -f ./test.sql -P 1 -c 50 -j 50 -t 2000000 > ./arr.log 2>&1 &

Solution 3: Bit Solution 1

BITs are used to store TAGs, where 0 and 1 indicate whether or not there is a TAG respectively.

APPID, USERID, TAG bit stream
  • Include all the TAGs specified by bit2 (BITs corresponding to TAGs that require configuration are set to 1, and the rest are set to 0). bitand(bit1,bit2) = bit2
  • Include any of the TAGs specified by bit2 (BITs corresponding to TAGs that need to be included are set to 1, and the rest are set to 0). bitand(bit1,bit2) > 0
  • Do not include any TAG specified by bit2 (BITs corresponding to TAGs that need to be included are set to 1, and the rest are set to 0). bitand(bit1,bit2) = zerobit(10000)
create table t_bit(appid int, userid int8, tags varbit) ; create index idx_t_bit_uid on t_bit(userid);
date;for ((i=1;i<=50;i++)); do psql -c "insert into t_bit select 1, 2000000000*random(), \
(select (string_agg(mod((2*random())::int,2)::text,''))::varbit from generate_series(1,10000)) tags \
from generate_series(1,2000000)" ; done; date
create or replace function randbit(int) returns varbit as 
$$
select (string_agg(mod((2*random())::int,2)::text,''))::varbit from generate_series(1,$1);$$
language sql strict volatile;
create or replace function zerobit(int) returns varbit as
$$
select (string_agg('0',''))::varbit from generate_series(1,$1);$$
language sql strict immutable;
update t_bit set tags=randbit(10000) where userid=:id;
do language plpgsql 
$$
declare
sql text;
bit1 varbit := randbit(10000);
bit2 varbit := randbit(10000);
bit3 varbit := randbit(10000);
zbit varbit := zerobit(10000);
begin
set max_parallel_workers_per_gather =27;
sql := 'select * from t_bit where bitand(tags,'''||bit1::text||''')='''||bit1::text||''' and bitand(tags,'''||bit2::text||''')>bit''0'' and bitand(tags,'''||bit3::text||''')='''||zbit::text||'''';
raise notice '%', sql;
-- execute sql;
end;
$$
;

Solution 4: Bit Solution 2

So do we have a solution that both save resources and offer the efficiency we need?

  • Users that include the tags (the result is users with bit 1). userids (bitand) userids
  • Users that do not include the tags (the result is users with bit 0). userids (bitor) userids
  • Users that include any of the tags (the result is users with bit 1). userids (bitor) userids

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com