Real-Time Marketing Data Analysis Based on Hologres + Flink


During the 2020 Double 11 Global Shopping Festival, Alibaba’s cloud-native real-time data warehouse was first implemented in core data scenarios and reached a new record for the big data platform. This data warehouse is built based on Hologres and Realtime Compute for Apache Flink, and has set a new record for the big data platform. This article focuses on the best practices of Hologres in Taobao’s marketing analysis scenarios. The article also discusses the technical challenges behind the first-time implementation of Apache Flink + Hologres stream-batch unification on the marketing analysis screen during Double 11.

1. Real-Time Marketing Data Analysis

Big promotions are crucial for business operations and user number growth for Taobao business operations. Marketing analysis services are the core data services used for decision-making and operation instructions during big promotions. They cover the analysis works of the whole process before, during, and after the promotions. Different requirements for data timeliness and flexibility must be met in different phases.

The traditional marketing analysis is based on conventional real-time offline data systems and FW product architectures. Many problems have been exposed in large and small activities in the past, and the core problems can be classified into three types:

Therefore, it is increasingly important to respond to the changing business requirements quickly and to process data efficiently during activities. The upgraded next-generation architecture for marketing analysis should have the following advantages:

  1. Unified models (unified real-time and offline logics) and unified interfaces (unified data storage and retrieval interfaces) of real-time and offline data warehouses should be achieved for stream-batch unification.
  2. A more powerful data warehouse is required to support highly-concurrent writes, and query large amounts of data with quick response time.
  3. The existing product construction logic needs to be simplified, and the product implementation complexity needs to be reduced.

Based on the background information listed above, it is necessary to reconstruct the current architecture and find alternatives to solve the business pain points. After a lot of attempts, the Apache Flink + Hologres + FBI (a visual analysis tool of Alibaba) solution was chosen to re-architect the architecture for the Tmall marketing analysis service.

2. Stream-Batch Unification Solution

After deeply analyzing business requirements for data and exploring data models multi-dimensionally, the overall technical framework for marketing analysis reconstruction is finally determined, as shown in the figure below. The core points of the framework include:

The following section will introduce the core technologies of the technical solution: stream-batch unification, Hologres, and FBI.

1) Stream-Batch Unified Technical Framework

The following figure shows the traditional architecture of data warehouses. The core issues of the traditional data warehouse architecture are listed below:

The following figure shows the stream-batch unified architecture of data warehouses. The upgraded architecture has the following core features:

The stream-batch unification data service based on Apache Flink has made a breakthrough in marketing analysis scenarios and has withstood severe stability, performance, and efficiency tests. During 2020 Double 11, the peak value of records processed by Flink reached 4 billion per second, and the data volume reached 7 TB per second. High stability was achieved in the entire Flink streaming and batch processing tasks during that time. The entire procedure was free of problems in terms of capacity, single points of servers, and network bandwidth.

2) Implementation of Hologres Stream-Batch Unification

The entire data layer is unified through the stream-batch unification data architecture. Another product is also needed to unify the entire storage layer. This product needs to support high-concurrency writes, real-time queries, and OLAP analysis.

In the architecture of the earlier versions, each page module involved the data query of one or more databases, such as MySQL, Apache HBase, and AnalyticDB 3.0 (the original HybridDB). Most real-time data was stored in HBase because HBase supported high-concurrency writing and high-performance point query. With the easy management and simple query of MySQL tables, dimension table data and offline data is usually stored in MySQL. In addition, the data used by some product modules have a small volume and multiple dimensions, such as marketing method data. AnalyticDB is selected as the database for OLAP multi-dimensional analysis. As a result, we had two pain points: the separation of real-time data from offline data and the disordered management of multiple databases and instances.

The new marketing analysis service is constructed to achieve unified storage, reduce O&M costs, and improve R&D efficiency. It also aims to realize high performance, high stability, and low cost.

After comparing multiple products, Hologres was selected for the marketing analysis service. Hologres is an all-in-one real-time data warehouse and is compatible with the PostgreSQL 11 protocol. It seamlessly connects to the big data ecosystem and supports the high-concurrency and low-latency analysis and has the capability to process PB-level data efficiently. It can easily and economically use existing BI tools to perform multi-dimensional analysis and business exploration of data. Hologres is extremely advantageous in complex business scenarios.

Based on the in-depth analysis of the modules in the marketing analysis service and the business-side requirements for data timeliness, specific real-time procedure solutions are made for these modules.

Based on the point query and OLAP analysis required for marketing analysis, dt-camp and dt-camp-olap libraries are created for Tmall marketing analysis. Some historical data during the marketing activity needs to be stored in the dt_camp library for a long time to make comparisons of different activities. The overall data volume is nearly 40 TB. What is stored in the OLAP library is the detailed data of the marketing methods, and the overall data volume is nearly 100 TB. As marketing methods require very high accuracy of the overall data, the inaccurate query method is not used. This puts forward higher requirements on the overall query performance of the database.

To improve the overall performance of Hologres, the following optimization strategies have been implemented for its database based on marketing analysis:

  1. Set the Distribution Key: In the case of the count(distinct user_id) statement, the user_id is set as the distribution key. This way, the count distinct is performed parallelly in each shard of Hologres. This prevents large amounts of data from being shuffled and significantly improves query performance.
  2. Minimize the Count Distinct Times: The SQL is converted through the multi-layer GROUP BY operation to reduce the cost of COUNT DISTINCT.
  3. Shard Prunning: In some scenarios, some of the primary keys of a table will be targeted during the query. If the key combination in these scenarios is set as the distribution key, shards hit by this query can be determined when the query is processed. Thus, the number of RPC requests is reduced, which is crucial for scenarios with high QPS.
  4. Generate the Optimal Plan: Marketing analysis includes point query or range query based on summarized data, OLAP query based on raw data, and TopN query after single-table aggregation. For different query types, Hologres can generate optimal execution plan based on the collected statistics to ensure the QPS and latency of queries.
  5. Optimize the Writing: The writing operations of marketing analysis are based on the UPDATE operations on column storage tables. In Hologres, the UPDATE operations find the corresponding uniqueid based on the specified primary keys (PK). Then, they find the corresponding record mark for deletion based on the uniqueid and finally insert a new record. In this case, if an incremental segment key can be set, the file can be located quickly based on the segment key during querying. This improves the record locating speed based on the PK, as well as writing performance. The peak writing traffic in the system stress testing of marketing analysis can reach 8 million records per second based on UPDATE operations.
  6. Compact Small Files: Some tables are written infrequently because the key updates during a fixed period of time is relatively constant. In such a case, the memory table would flush small files sometime. However, the default compaction strategy of Hologres does not always work on these files, resulting in a relatively large number of small files. Through in-depth optimization of the compaction parameters and increasing the compaction frequency, small file number is reduced, improving the query performance significantly.

Hologres performed stably during 2020 Double 11. The peak writing speed of point query in Hologres reached several hundreds of thousands of records per second, while millions of records were served per second. For OLAP, the peak writing speed reached four million records per second, and five million records were served per second. Meanwhile, the execution time of more than 99.7% of single point queries and OLAP queries was less than one millisecond. Therefore, the overall system performance was stable throughout Double 11 and supported fast point queries and OLAP analyses simultaneously.

3) FBI Analysis Big Screen

FBI is the preferred data visualization platform in the Alibaba ecosystem. It can quickly create reports for data analysis and support fast access and expansion of multiple data sets. It also supports the construction of various analytical data products through the product construction feature.

During the core process of product construction in FBI, four core functions can be used to reduce the construction cost:

  1. The “real-time hour-minute model” with online-offline unification enables accurate comparison and trend analysis of real-time data.

For the underlying data after stream-batch unified processing, users demand the flexible analysis of real-time data, real-time comparison, and hourly comparison. To meet users’ demands, FBI abstracts a standard data model featuring online-offline unification. This model allows accurate comparison of real-time data. For trend analysis, minute-based trends are automatically routed to minute-based tables, and hour-based trends are directly routed to hour-based tables.

  1. FBI’s original FAX functions are used to define and output various complex metrics in an extremely simplified way.

Complex metrics in FBI, such as channel proportion, category proportion, year-on-year contribution, and cumulative volume of activities, were all defined with complex SQL statements in the previous version. As a result, the SQL statements were too long, and the stability and maintainability of the product were also reduced. To solve this problem, FBI built a set of analytical DSL that is easy to learn and understand, called FAX functions. FAX functions include more than 20 analysis functions, such as year-on-year balance, contribution rate, and activity accumulation. Simple statements can be used to define a variety of complex metrics used in marketing analysis.

  1. The configurable analysis capabilities and plug-in exclusive logic can reduce page construction time.

Product page construction is a crucial process. FBI’s method to simplify user configuration is listed below:

  1. General Configurable Analysis Capabilities: Most commonly used analysis capabilities, such as cross tables, activity comparison, and date variable parameter passing, are upgraded to simple configuration items. Then, the year-on-year comparison and year-on-year balance analyses can be completed.
  2. Plug-In Exclusive Logic: Some customized capabilities, such as activity parameters, result displaying and hiding, and result sorting, are used for blocks. These capabilities can be transformed into data plug-ins.
  3. FBI’s high security system has been developed and tested. The release control, monitoring and alert, and change notification features have been upgraded with support for 1–5–10.

3. Test-Side Ensuring

To further ensure the quality of the marketing analysis service, the test side strictly compares and verifies data at the DWD layer, the DWS layer, and the product layer. A full range of monitoring is also carried out on the core data of the big promotion.

During 2020 Double 11, the test and inspection feature improved the ability to actively identify data problems and core issues in a timely manner. The quality and stability of the whole data product improved significantly.

4. Business Feedback and Value

During 2020 Double 11, the stream-batch unified marketing analysis service based on Apache Flink + Hologres supported an average of hundreds of PV access per capita for thousands of Tmall listings. The service also achieved zero P1/P2 failure. At the same time, several advantages of the service were shown during the activity compared with previous years:

5. Future Prospects

Even though it was tested by 2020 Double 11, it is still necessary to continuously improve the technologies to deal with more complex business scenarios:

  1. Dataphin’s stream-batch unification products need further improvement to reduce manual intervention costs and further ensure data quality.
  2. The Hologres resource separation, and isolation of read and write resources will be improved to better ensure query SLA. Hologres and MaxCompute interconnection can support metadata interconnection to have high guarantee on the product metadata. Dynamic scaling can be applied to flexibly respond to peak and daily business needs.
  3. FBI tools can improve the version management function of products. The same page supports multi-person editing and enables more efficient product construction.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.