Real-time Big Data Computing with Table Store and Blink

Table Store is a NoSQL Multi-model database developed by Alibaba Cloud. It provides PB-level structured data storage, 10 million TPS, and millisecond-level latency service. In real-time computing, Table Store provides powerful writing capability and multi-model storage forms, allowing it to be used not only as a computing result table, but also as a real-time computing source table.

Tunnel Service is a full incremental and integrated data consumption function provided by Table Store. It provides users with three types of distributed data real-time consumption tunnels: Stream, BaseData and BaseAndStream. In real-time computing, you can create a data tunnel for a data table to consume historical data and new data in the table in stream computing mode.

With the powerful write capability of the Table Store engine and the complete streaming consumption capability of Tunnel Service, you can easily integrate data storage and real-time processing.

Blink: Stream-batch Integrated Data Processing Engine

From the perspective of the technical architecture of users, the following can be achieved by combining Table Store and Blink:

  • In terms of storage, by using Table Store, only one copy of data needs to be written, so that the service can be immediately visible and the subsequent stream computing consumption can be supported natively, without dual writing of the service.
  • In terms of computation, Blink, the stream-batch Integrated data processing engine, can be used to integrate the stream computing and batch computing architecture to develop a set of codes that support both stream computing and batch computing scenarios.

This article introduces the best architecture practice of real-time computing based on Table Store and Blink, as well as the data analysis job based on Table Store and Blink.

Better Real-Time Computing Architecture with Table Store and Blink

From a technical point of view, the client needs: 1. The real-time processing capability of customer data, continuous aggregation of new order information, and big screen display and daily report display; 2. The ability of offline analysis of historical data, to analyze offline data for situational awareness and decision-making recommendations.

The classic solution is basically based on the Lambda Big Data Architecture. As shown in Figure 1, the user data needs to enter both the message queue system (New Data Stream, such as Kafka) as the input source for real-time computing tasks, and the database system (All Data, such as HBASE) to support the batch processing system. Finally, the results of both are written to the database system (MERGED VIEW) and displayed to the user.

Figure 1. Lambda Big Data Architecture

The disadvantage of this system is that it is too large and needs to maintain multiple distributed subsystems. Data must be written into the message queue and be imported into the database, and the dual-write consistency between the two must be processed or the synchronization scheme between the two must be maintained. In terms of computing, it is technically difficult and labor intense to maintain two sets of computing engines and develop two sets of data analysis code.

Using Table Store with powerful writing and real-time data consumption capability, and Blink with high performance SQL processing and stream-batch integration, the classic Lambda architecture can be simplified as shown in Figure 2, a real-time computing architecture based on Table Store and Blink:

Figure 2. Real-time computing architecture based on Table Store and Blink

The dependency systems introduced by this architecture is greatly reduced, and both the labor and resource costs decline significantly. Its basic processes include:

  • The user writes the online order data or system captured data to the Table Store source table, and the source table creates a Tunnel Service data tunnel;
  • For real-time computing tasks (yellow line), use Blink and Table Store data source DDL to define the SQL source table and result table, and develop and debug the SQL job for real-time order daily aggregate;
  • For batch computing tasks (green line), define the batch source table and result table [1], and develop the SQL job of historical order analysis;
  • The front-end service displays the daily report and historical analysis results by reading the Table Store result table;

Quick Start: Real-time Daily Report Computing SQL

In the Table Store console, create a consumption order table consume_source_table(primary key: id[string]), create an Stream tunnel blink-demo-stream under Order Table-> Tunnel Management, and create a daily result summary table result_summary_day(primary key: summary_date[string]);

On the Blink development interface, create a consumption order source table, a daily result summary table, an aggregated view per minute, and write SQL:

--- Consumption order source table
CREATE TABLE source_order (
id VARCHAR,-- Order ID
restaurant_id VARCHAR, -- Restaurant ID
customer_id VARCHAR,-- Customer ID
city VARCHAR,-- City
price VARCHAR,-- Price
pay_day VARCHAR, -- Order Time yyyy-MM-dd
primary(id)
) WITH (
type='ots',
endPoint ='http://blink-demo.cn-hangzhou.ots-internal.aliyuncs.com',
instanceName = "blink-demo",
tableName ='consume_source_table',
tunnelName = 'blink-demo-stream',
);
--- Daily result summary table
CREATE TABLE result_summary_day (
summary_date VARCHAR,-- Summary date
total_price BIGINT,-- Total order amount
total_order BIGINT,-- Number of orders
primary key (summary_date)
) WITH (
type= 'ots',
endPoint ='http://blink-demo.cn-hangzhou.ots-internal.aliyuncs.com',
instanceName = "blink-demo",
tableName ='result_summary_day',
column='summary_date,total_price,total_order'
);
INSERT into result_summary_day
select
cast(pay_day as bigint) as summary_date, -- Time partition
count(id) as total_order, -- Client IP address
sum(price) as total_order, -- Client deduplication
from source_ods_fact_log_track_action
group by pay_day;

Go online to aggregate SQL, and write order data into the Table Store source table. You can see the number of daily orders continuously updated by result_summary_day, and the large screen display system can directly interface with the root result_summary_day.

Summary

  1. Powerful storage and computing engines. In addition to mass storage and extremely high read and write performance, Table Store also provides a variety of data analysis functions, such as the SearchIndex, Secondary Index, and Tunnel Service, and has obvious advantages compared with open-source solutions, such as HBASE. The key performance metrics of Blink is 3 to 4 times that of open-source Flink, and the data computing latency is optimized to second level or even sub-second level.
  2. Fully hosted services. Both Table Store and Blink are fully hosted serverless services, which are out-of-the-box;
  3. Low labor and resource costs. The services that the architecture relies on are all serverless, so it is free of O&M, and pay-as-you-go, to avoid the impact of peaks and troughs;

As an introductory piece, this article mainly introduces the advantages of the big data architecture using Table Store and Blink, as well as simple SQL demonstrations. Subsequently, more complex and close-to-scenario articles will be released one after another. Please check them out!

Reference

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com