Use Case Analysis — Real-time Precision Marketing and Group Selection

Alibaba Cloud
10 min readJul 7, 2020

Use Case

This article demonstrates a case of quickly identifying target groups based on their characteristics. Such cases are common in almost all industries, including the Internet, new retail, education, gaming, healthcare, and social networking. Let’s take a look at some examples:

  • In the e-commerce industry, before launching operational activities, merchants select a group of target users to push advertisements or match activity conditions based on the characteristics of target groups of the activities.
  • In the gaming industry, operations are often determined based on some characteristics of players, and bonus packages are distributed in a targeted manner to spur player activity.
  • In the education industry, various targeted exercises are pushed based on the characteristics of students to improve the students’ skills in areas of weakness.
  • In businesses such as online searching, portals, and video websites, group push content varies according to users’ hot spots and recent behavior.

Pain Points

Business Features

  • The data volume is large and the computing workload is high.
  • There are many user tags and fields, which consume massive storage space.
  • There are many fields, the number of which may exceed the threshold in the database.

Generally, a database supports a maximum of more than 1,000 fields.

  • To use arrays for replacing multi-field storage tags, the database must support inverted index entries. However, not all databases support inverted indexes.
  • The use of arrays instead of multi-field storage tags, along with inverted indexes, sharply increases the required storage space.
  • The combinations of selection conditions are diverse, and therefore the optimization is not made on fixed index entries. In this case, each field corresponds to an index entry, causing a sharp increase in the required storage space.
  • Performance requirements are demanding, and real-time marketing and second-level responsiveness are required.
  • The requirement for prompt data update is high, and user profiling needs to be updated in quasi-real-time. If a quasi-real-time update is not possible, user selection can be inaccurate.

For example, if a user browsed mobile phones yesterday and placed an order later at night; but profiling data was not updated, then this user will be selected for mobile phone sellers, while actually the user is no longer in the target group.

Business Pain Points

Common database products, such as MySQL, have limited resources and cannot meet the requirement of real-time target group selection.

Technical Solution

Solution 1

Table Structure: Consider the following table structure.

KEY: 用户ID
标签1:
标签2:
...
标签N:

Index: Consider the following index.

每个标签字段一个索引

Search Method: Consider the following search method.

and , or , not 组合
where 标签a and 标签b and ...

Defects: Note the following defects of this solution method:

  • There are many user tags and fields, which consume massive storage space.
  • There are many fields, the number of which may exceed the threshold in the database.

Generally, a database supports a maximum of more than 1,000 fields.

  • The combinations of selection conditions are diverse, and therefore, the optimization is not made on fixed index entries. In this case, each field corresponds to an index entry, causing a sharp increase in the required storage space.
  • A large amount of data needs to be updated when a new multi-user group (tag) is added.
  • Poor query performance
  • Solution 1 can also use the many-to-many structure with one record for each tag, solving the problem of exceeding the field quantity threshold. Alternatively, the limit on the number of fields can be solved by splitting tables, but cross-table JOIN is probably required during queries.

Solution 2

Table Structure: Consider the following table structure.

KEY:用户ID
VALUES:标签数组

Index: Consider the following index.

标签数组字段: GIN倒排索引

Search Method: Consider the following search method.

与、或、非
where VALUES @> array[标签s] -- 与
where VALUES && array[标签s] -- 或
where not VALUES @> array[标签s] -- 非

Defects: Note the following defects of this solution method:

  • To use arrays for replacing multi-field storage tags, the database must support inverted index entries. However, not all databases support inverted indexes.
  • The use of arrays instead of multi-field storage tags, along with inverted indexes, sharply increases the required storage space.
  • A large amount of data needs to be updated when a new multi-user group (tag) is added.

Solution 3

Table Structure: Consider the following table structure.

KEY:标签ID
VALUES: 用户bitmap

Index: Consider the following index.

标签ID字段: Btree索引

Search Method: Consider the following search method.

聚合bitmap: 与、或、非
and_agg(bitmaps) where KEY in (标签s) -- 与
or_agg(bitmaps) where KEY in (标签s) -- 或
except(bitmap1,bitmap2) -- 非

Defects: Note the following drawbacks of this solution method:

  • The maximum length of a bitmap is 1 GB. If the number of users exceeds the maximum length, use the offset as follows:

a) offset0_bitmap, offset1gb_bitmap, ...

  • The user ID must be a number (serial numbers are recommended.) If no numeric UID exists, create a mapping table.

Strengths: Note the following advantages of this solution method:

  • Table storage occupies a small space.
  • Indexes occupy a small amount of storage space.

a) Only one B-tree index is required, and the number of index entries is small (the number of tags is equal to that of records. Generally, the number of tags is less than one million).

  • While adding a new multi-user group (tag), only add a bitmap record for the new group rather than updating a large amount of data.
  • The query performance is outstanding.

Demo

Prerequisites include the following operations:

1) Purchase ApsaraDB RDS for PostgreSQL 12.
2) Purchase ApsaraDB RDS for MySQL 8.0.
3) Set up a whitelist.
4) Create a user.
5) Create a database.

Solution 1 Demo

MySQL does not support the array type, inverted indexes, and the bitmap feature. Therefore, only Solution 1 is supported.

1) MySQL 8.0
2) PostgreSQL 12

Step 1) Create a group table, with each entry representing a group.

create table t_tag_dict (
tag int primary key, -- 标签(人群)id
info text, -- 人群描述
crt_time timestamp -- 时间
);

Step 2) Generate 100,000 groups (tags).

insert into t_tag_dict values (1, '男', now());
insert into t_tag_dict values (2, '女', now());
insert into t_tag_dict values (3, '大于24岁', now());
-- ...
insert into t_tag_dict
select generate_series(4,100000), md5(random()::text), clock_timestamp();

Step 3) Create a user profile table (N entries for each user and each record represents a tag applied to this user).

create table t_user_tag (
uid int8, -- 用户id
tag int, -- 用户对应标签(人群)
mod_time timestamp, -- 时间
primary key (tag,uid)
);

Step 4) Tag 20 million users. Each user has 64 random tags, and half of the users are men and the other half are women. The total number of entries is 1.28 billion.

create or replace function gen_rand_tag(int,int) returns setof int as
$$
select case when random() > 0.5 then 1::int else 2::int end as tag
union all
select ceil(random()*$1)::int as tag from generate_series(1,$2);
$$ language sql strict volatile;
insert into t_user_tag
select uid, gen_rand_tag(100000,63) as tag, clock_timestamp()
from generate_series(1,20000000) as uid on conflict (uid,tag) do nothing;
-- 或使用如下方法加速导入
create sequence seq;
vi test.sql
insert into t_user_tag
select uid, gen_rand_tag(100000,63) as tag, clock_timestamp()
from nextval('seq'::regclass) as uid
on conflict(tag,uid) do nothing;
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 400000

Step 5) Query groups that contain tags 1 and 3.

1、人群数量
select count(*) from
(
select uid from t_user_tag where tag=1
intersect
select uid from t_user_tag where tag=3
) t;
-- Time: 1494.789 ms (00:01.495)
2、提取人群ID
select uid from t_user_tag where tag=1
intersect
select uid from t_user_tag where tag=3;
-- Time: 3246.184 ms (00:03.246)

Step 6) Query groups that contain tag 1, 3, 10, or 200.

1、人群数量
select count(*) from
(
select uid from t_user_tag where tag=1
union
select uid from t_user_tag where tag=3
union
select uid from t_user_tag where tag=10
union
select uid from t_user_tag where tag=200
) t;
-- Time: 3577.714 ms (00:03.578)
2、提取人群ID
select uid from t_user_tag where tag=1
union
select uid from t_user_tag where tag=3
union
select uid from t_user_tag where tag=10
union
select uid from t_user_tag where tag=200;
-- Time: 5682.458 ms (00:05.682)

Step 7) Check the used space as shown below.

public | t_user_tag         | table | postgres | 62 GB   | 
public | t_user_tag_pkey | index | postgres | t_user_tag | 61 GB |

Solution 2 Demo

1) PostgreSQL 12

Step 1) Create a group table, with each entry representing a group.

create table t_tag_dict (
tag int primary key, -- 标签(人群)id
info text, -- 人群描述
crt_time timestamp -- 时间
);

Step 2) Generate 100,000 groups (tags).

insert into t_tag_dict values (1, '男', now());
insert into t_tag_dict values (2, '女', now());
insert into t_tag_dict values (3, '大于24岁', now());
-- ...
insert into t_tag_dict
select generate_series(4,100000), md5(random()::text), clock_timestamp();

Step 3) Create a user profile table (one entry for each user and an array is used to indicate which tags a user has).

create table t_user_tags (
uid int8 primary key, -- 用户id
tags int[], -- 用户标签(人群)数组
mod_time timestamp -- 时间
);

Step 4) Create a function to generate a random tagging array.

create or replace function gen_rand_tags(int,int) returns int[] as $$
select array_agg(ceil(random()*$1)::int) from generate_series(1,$2);
$$ language sql strict;

Step 4.1) Randomly select 8 tags from the 100 thousand tags:

select gen_rand_tags(100000, 8);                   gen_rand_tags                   
---------------------------------------------------
{43494,46038,74102,25308,99129,40893,33653,29690}
(1 row)

Step 5) Tag 20 million users. Each user has 64 random tags, and half of the users are men while the other half are women.

insert into t_user_tags 
select generate_series(1,10000000),
array_append(gen_rand_tags(100000, 63),1), now();
insert into t_user_tags
select generate_series(10000001,20000000),
array_append(gen_rand_tags(100000, 63),2), now();

Step 6) Create an inverted index for the tag (group) field.

create index idx_t_user_tags_1 on t_user_tags using gin (tags);

Step 7) Query groups that contain tags 1 and 3 as shown below.

1、人群数量
select count(uid) from t_user_tags where tags @> array[1,3];
2、提取人群ID
select uid from t_user_tags where tags @> array[1,3];

Step 8) Query groups that contain tag 1, 3, 10, or 200 as shown below.

1、人群数量
select count(uid) from t_user_tags where tags && array[1,3,10,200];
2、提取人群ID
select uid from t_user_tags where tags && array[1,3,10,200];

Solution 3 Demo

1) PostgreSQL 12

ApsaraDB RDS for PostgreSQL 12 supports the bitmap feature. Instructions for using this feature are as follows.

安装插件 – create extension roaringbitmap;bitmap输出格式 – set roaringbitmap.output_format='bytea|array';bitmap取值范围 – 40亿(int4) 构造bitmap –  rb_build(int4[]) bitmap转换为数组或多条记录 - rb_to_array(rb) – rb_iterate(rb) bitmap内包含对象个数 – rb_cardinality(rb) 逻辑运算: 与、或、异或、差
SELECT roaringbitmap('{1,2,3}') | roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') & roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') # roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') - roaringbitmap('{3,4,5}');
聚合运算: build rb、与、或、异或
SELECT rb_build_agg(e) FROM generate_series(1,100) e;
SELECT rb_or_agg(bitmap) FROM t1;
SELECT rb_and_agg(bitmap) FROM t1;
SELECT rb_xor_agg(bitmap) FROM t1;
聚合并统计对象数(与、或、异或)
rb_or_cardinality_agg
rb_and_cardinality_agg
rb_xor_cardinality_agg
逻辑判断: 包含、相交、相等、不相等
Opperator Input Output Desc Example Result
@> roaringbitmap,roaringbitmap bool contains roaringbitmap('{1,2,3}') @> roaringbitmap('{3,4,5}') f
@> roaringbitmap,integer bool contains roaringbitmap('{1,2,3,4,5}') @> 3 t
<@ roaringbitmap,roaringbitmap bool is contained by roaringbitmap('{1,2,3}') f
<@ integer,roaringbitmap bool is contained by 3 t
&& roaringbitmap,roaringbitmap bool overlap (have elements in common) roaringbitmap('{1,2,3}') && roaringbitmap('{3,4,5}') t
= roaringbitmap,roaringbitmap bool equal roaringbitmap('{1,2,3}') = roaringbitmap('{3,4,5}') f
<> roaringbitmap,roaringbitmap bool not equal roaringbitmap('{1,2,3}') <> roaringbitmap('{3,4,5}') t

When UID exceeds int4 (4 billion), use offset for conversion. For more information, visit: https://pgxn.org/dist/pg_roaringbitmap/

Instructions for using the bitmap feature in ApsaraDB RDS for PostgreSQL to be supplemented.

Step 1) Install the plug-in.

create extension roaringbitmap;

Step 2) Create tags and the user bitmap table.

create table t_tag_users (
tagid int primary key, -- 用户标签(人群)id
uid_offset int, -- 由于userid是int8类型,roaringbitmap内部使用int4存储,需要转换一下。
userbits roaringbitmap, -- 用户id聚合的 bitmap
mod_time timestamp -- 时间
);

Step 3) Generate tags and the UID bitmap.

insert into t_tag_users 
select tagid, uid_offset, rb_build_agg(uid::int) as userbits from
(
select
unnest(tags) as tagid,
(uid / (2^31)::int8) as uid_offset,
mod(uid, (2^31)::int8) as uid
from t_user_tags
) t
group by tagid, uid_offset;

Step 4) Query groups that contain tags 1 and 3 as shown below.

1、人群数量
select sum(ub) from
(
select uid_offset,rb_and_cardinality_agg(userbits) as ub
from t_tag_users
where tagid in (1,3)
group by uid_offset
) t;
2、提取人群ID
select uid_offset,rb_and_agg(userbits) as ub
from t_tag_users
where tagid in (1,3)
group by uid_offset;

Step 5) Query groups that contain tag 1, 3, 10, or 200 as shown below.

1、人群数量
select sum(ub) from
(
select uid_offset,rb_or_cardinality_agg(userbits) as ub
from t_tag_users
where tagid in (1,3,10,200)
group by uid_offset
) t;
2、提取人群ID
select uid_offset,rb_or_agg(userbits) as ub
from t_tag_users
where tagid in (1,3,10,200)
group by uid_offset;

Solution Comparison

The following table represents the case environment:

The following table shows the performance comparison:

Value of ApsaraDB RDS for PostgreSQL Solution

1) ApsaraDB RDS for PostgreSQL supports the RoaringBitmap function, which generates, compresses, and parses bitmap data very efficiently. It also supports the most common bitmap aggregation operations such as AND, OR, NOT, and XOR. Moreover, it extracts the IDs and selectivity of bitmaps and checks whether an ID exists.
2) ApsaraDB RDS for PostgreSQL is used to meet the needs of real-time precision marketing to hundreds of millions of users. It is also used for the quick tagging of users with massive data volumes including tens of millions of tags.
3) Unlike the MySQL solution, the ApsaraDB RDS for PostgreSQL solution is much more cost-effective and efficient.

  • Used storage space is reduced by 8948%.
  • The average performance is improved by 155782.5%.
  • The maximum performance is improved by 211665%.

Note that the following versions of ApsaraDB RDS for PostgreSQL support this feature:

  • ApsaraDB RDS for PostgreSQL version 12
  • In the future, the feature will be supported by version 10 and later.

Original Source:

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com