ClickHouse: A Business Weapon for Crowd Selection

What Is Crowd Selection?

  • E-Commerce Industry: Before activities, merchants need to select a group of target customers according to the characteristics of the target groups to push advertisements to and assess if they fit the activities.
  • Gaming Industry: The platforms present gifts to players based on player characteristics to improve player activity.
  • Education Industry: Targeted exercises need to be pushed to students based on their characteristics to help them learn.
  • Business Scenarios, such as Searches, Portals, and Video Websites: Specific content is pushed to users based on user concerns.
  1. Male: The target group for the product is male.
  2. Sports Enthusiasts: Sports enthusiasts are more likely to consume sports products.
  3. First-Tier Cities: Users in first-tier cities may consume more high-end products compared to users in second and third-tier cities.
  • Various user tags with hundreds or thousands of tag columns
  • A large amount of data and users that require a great deal of computing
  • Various Combinations of Selection Conditions: No fixed index is available for optimization and the storage space usage is high.
  • High Performance Requirements: The in-time selection results are required, and too much delay may result in inaccurate group marketing.
  • High Requirements on the Timeliness of Data Updates: User profiles need to be updated in near real-time. The expired user information will directly affect the accuracy of the selection.

Why ClickHouse?

Cost

Data Update and Governance

Ease of Use

Building a Crowd Selection System Based on ClickHouse

  1. Efficient batch data import performance
  2. Frequent processing and real-time updating
  3. DDL ability to add and delete columns
  4. Efficient query capability by specifying any column to be filtered

Update Statement in Place of Insert Statement

ALTER TABLE [db.]table UPDATE column1 = expr1 [, ...] WHERE filter_expr;
  1. Checks whether the Update operation is legal
  2. Saves the Update command to the storage file and awakens a worker thread to process merge and mutation asynchronously.
  • Locate the datapart where the data to be updated is
  • Scan the entire datapart and update the data that needs changing
  • Rewrite the data to the disk to generate a new datapart
  • Remove the expired datapart and replace it with a new one

Using the AggregatingMergeTree Table Engine

CREATE TABLE IF NOT EXISTS whatever_table ON CLUSTER default
(
user_id UInt64,
city_level SimpleAggregateFunction(anyLast, Nullable(Enum('First-tier city' = 0, 'Second-tier city' = 1, 'Third-tier city' = 2, 'Fourth-tier city' = 3))),
gender SimpleAggregateFunction(anyLast, Nullable(Enum('Female' = 0, 'Male' = 1))),
interest_sports SimpleAggregateFunction(anyLast, Nullable(Enum('No' = 0, 'Yes' = 1))),
reg_date SimpleAggregateFunction(anyLast, Datetime),
comment_like_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),
last30d_share_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),
user_like_consume_trend_type SimpleAggregateFunction(anyLast, Nullable(String)),
province SimpleAggregateFunction(anyLast, Nullable(String)),
last_access_version SimpleAggregateFunction(anyLast, Nullable(String)),
others SimpleAggregateFunction(anyLast,Array(String))
)ENGINE = AggregatingMergeTree() partition by toYYYYMMDD(reg_date) ORDER BY user_id;

Data Consistency Assurance

Demo

import java.sql.*;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.concurrent.TimeoutException;
public class Main {
private static final String DATE_FORMAT = "yyyy-MM-dd HH:mm:ss";
private static final SimpleDateFormat SIMPLE_DATE_FORMAT = new SimpleDateFormat(DATE_FORMAT);
public static void main(String[] args) throws ClassNotFoundException, SQLException, InterruptedException, ParseException {
String url = "your url";
String username = "your username";
String password = "your password";
Class.forName("ru.yandex.clickhouse.ClickHouseDriver");
String connectionStr = "jdbc:clickhouse://" + url + ":8123";
try {
Connection connection = DriverManager.getConnection(connectionStr, username, password);
Statement stmt = connection.createStatement();
// Create a local table
String createLocalTableDDL = "CREATE TABLE IF NOT EXISTS whatever_table ON CLUSTER default " +
"(user_id UInt64, " +
"city_level SimpleAggregateFunction(anyLast, Nullable(Enum('First-tier city' = 0, 'Second-tier city' = 1, 'Third-tier city' = 2, 'Fourth-tier city' = 3))), " +
"gender SimpleAggregateFunction(anyLast, Nullable(Enum('Female' = 0, 'Male' = 1)))," +
"interest_sports SimpleAggregateFunction(anyLast, Nullable(Enum('No' = 0, 'Yes' = 1)))," +
"reg_date SimpleAggregateFunction(anyLast, Datetime)) " +
"comment_like_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),\n" +
"last30d_share_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),\n" +
"user_like_consume_trend_type SimpleAggregateFunction(anyLast, Nullable(String)),\n" +
"province SimpleAggregateFunction(anyLast, Nullable(String)),\n" +
"last_access_version SimpleAggregateFunction(anyLast, Nullable(String)),\n" +
"others SimpleAggregateFunction(anyLast, Array(String)),\n" +
"ENGINE = AggregatingMergeTree() PARTITION by toYYYYMM(reg_date) ORDER BY user_id;";
stmt.execute(createLocalTableDDL);
System.out.println("create local table done.");
// Create a distributed table
String createDistributedTableDDL = "CREATE TABLE IF NOT EXISTS whatever_table_dist ON cluster default " +
"AS default.whatever_table " +
"ENGINE = Distributed(default, default, whatever_table, intHash64(user_id));";
stmt.execute(createDistributedTableDDL);
System.out.println("create distributed table done");
// Insert mock data
String insertSQL = "INSERT INTO whatever_table(\n" +
"\tuser_id,\n" +
"\tcity_level,\n" +
"\tgender,\n" +
"\tinterest_sports,\n" +
"\treg_date,\n" +
"\tcomment_like_cnt,\n" +
"\tlast30d_share_cnt,\n" +
"\tuser_like_consume_trend_type,\n" +
"\tprovince,\n" +
"\tlast_access_version,\n" +
"\tothers\n" +
"\t)SELECT\n" +
" number as user_id,\n" +
" toUInt32(rand(11)%4) as city_level,\n" +
" toUInt32(rand(30)%2) as gender,\n" +
" toUInt32(rand(28)%2) as interest_sports,\n" +
" (toDateTime('2020-01-01 00:00:00') + rand(1)%(3600*24*30*4)) as reg_date,\n" +
" toUInt32(rand(15)%10) as comment_like_cnt,\n" +
" toUInt32(rand(16)%10) as last30d_share_cnt,\n" +
"randomPrintableASCII(64) as user_like_consume_trend_type,\n" +
"randomPrintableASCII(64) as province,\n" +
"randomPrintableASCII(64) as last_access_version,\n" +
"[randomPrintableASCII(64)] as others\n" +
" FROM numbers(100000);\n";
stmt.execute(insertSQL);
System.out.println("Mock data and insert done.");
System.out.println("Select count(user_id)...");
ResultSet rs = stmt.executeQuery("select count(user_id) from whatever_table_dist");
while (rs.next()) {
int count = rs.getInt(1);
System.out.println("user_id count: " + count);
}
// Merge data
String optimizeSQL = "OPTIMIZE table whatever_table final;";
// If the data merge time is too long, execute the optimize command in parallel at the partition level.
String optimizeByPartitionSQL = "OPTIMIZE table whatever_table PARTITION 202001 final;";
try {
stmt.execute(optimizeByPartitionSQL);
}catch (SQLTimeoutException e){
// View merge progress
// String checkMergeSQL = "select * from system.merges where database = 'default' and table = 'whatever_table ';";
Thread.sleep(60*1000);
}
// Crowd Selection(city_level='First-tier city',gender='Male',interest_sports='Yes', reg_date<='2020-01-31 23:59:59')
String selectSQL = "SELECT user_id from whatever_table_dist where city_level=0 and gender=1 and interest_sports=1 and reg_date <= NOW();";
rs = stmt.executeQuery(selectSQL);
while (rs.next()) {
int user_id = rs.getInt(1);
System.out.println("Got suitable user: " + user_id);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

PostScript

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com