Meituan-Dianping’s Use of Flink-based Real-time Data Warehouse Platforms

  1. Flink-based real-time data warehouse platform
  2. Future development and considerations

1) Meituan-Dianping’s Real-time Computing Evolution and Business Practices

Evolution of Real-time Computing at Meituan-Dianping

In 2016, Meituan-Dianping already had an initial platform based on Apache Storm. In early 2017, Meituan-Dianping introduced Spark Streaming for specific scenarios, mainly for data synchronization. At the end of 2017, Meituan-Dianping introduced Flink on its real-time computing platform. Compared with Apache Storm and Spark Streaming, Flink has many advantages. At this stage, Meituan-Dianping carried out an in-depth platform-oriented restructuring of its system, focusing on the security, stability, and ease of use. Since 2019, Meituan-Dianping has been committed to providing real-time data warehouse and machine learning solutions to better support businesses.

Real-time Computing Platform

At present, Meituan-Dianping has thousands of machines and tens of thousands of active jobs on its real-time computing platform every day, with 150 million messages processed per second during peak hours and thousands of users using the real-time computing service.

Real-time Computing Platform Architecture

The following figure shows the real-time computing architecture of Meituan-Dianping.

  • Job release includes version management, compilation, release, and rollback.
  • Job status includes the runtime status, custom metrics and alerts, and command and runtime logs.

Business Data Warehouse Practices

Traffic

  • In terms of data sources, traffic data warehouses obtain data from logs, business data warehouses obtain data from business binary logs, and feature data warehouses obtain data from various sources.
  • In terms of data volume, the data volume of traffic and feature data warehouses is massive, with tens of billions of data records every day, while the data volume of business data warehouses is millions to tens of millions every day.
  • In terms of data update frequency, traffic data is rarely updated while business and feature data is updated more often. We focus on time series and trends of traffic data, and status changes of business data and feature data.
  • In terms of data accuracy, traffic data can be less accurate while business data and feature data must be highly accurate.
  • In terms of model adjustment frequency, business data models are adjusted frequently while traffic data and feature data models are adjusted rarely.

2) Flink-based Real-time Data Warehouse Platform

This section describes the evolution of our real-time data warehouses and the development ideas behind Meituan-Dianping’s real-time data warehouse platform.

Offline Data Warehouse Model

To organize and manage data more effectively, the offline data warehouse model is divided into the operational data store (ODS) layer, data warehouse detail (DWD) layer, data warehouse service (DWS) layer, and application layer from bottom up. Ad hoc queries are implemented through Presto, Hive, and Spark.

Real-time Data Warehouse Model

The real-time data warehouse model is also divided into the ODS layer, DWD layer, DWS layer, and application layer. However, the real-time data warehouse model differs from the offline data warehouse model in processing methods. For example, data at the DWD and DWS layers is stored in Apache Kafka, dimension data is stored in KVStores, such as HBase and Tair, to improve performance, and ad hoc queries can be performed by using Flink.

Quasi-real-time Data Warehouse Model

Service providers can also use a quasi-real-time data warehouse model, which is not completely based on streams. Instead, data at its DWD layer is imported to Online Analytical Processing (OLAP) and then summarized and further processed based on the OLAP computing capability.

Comparison between Real-time Data Warehouses and Offline Data Warehouses

Real-time data warehouses and offline data warehouses can be compared in the following aspects:

  • In terms of fact data storage, offline data warehouses store fact data based on HDFS, while real-time data warehouses store fact data based on message queues, such as Message Queue for Apache Kafka.
  • In terms of dimension data storage, real-time data warehouses store data in KVStore.
  • In terms of data processing, offline data warehouses use batch processing components such as Hive and Spark, while real-time data warehouses use real-time computing engines, such as Apache Storm and Flink, for stream processing.

Comparison between Real-time Data Warehouse Construction Solutions

The following figure compares the construction solutions for quasi-real-time data warehouses and real-time data warehouses. These solutions are implemented based on OLAP engines or stream computing engines and update data within minutes or seconds respectively.

  • In terms of business flexibility, quasi-real-time data warehouses use OLAP engines, which are more flexible than the stream computing of real-time data warehouses.
  • In terms of data latency tolerance, quasi-real-time data warehouses can perform full computing on data in a given period, while real-time data warehouses perform incremental computing on data. Therefore, the former has a higher tolerance for data latency.
  • In terms of scalability, quasi-real-time data warehouses integrate data computing and storage capabilities, making them less scalable than real-time data warehouses.
  • In terms of application scenarios, quasi-real-time data warehouses are used in scenarios with moderate timeliness requirements, small data volumes, complex multi-table association, and frequent business changes, such as real-time analysis of transaction data. Real-time data warehouses are more suitable for scenarios with high timeliness requirements and large data volumes, such as real-time features, traffic distribution, and real-time analysis of traffic data.

All-in-one Solution

Through business practices, we discovered that the metadata of different businesses was separated, and business development personnel tended to use SQL statements to develop both offline data warehouses and real-time warehouses, which required more O&M tools. Therefore, we planned an all-in-one solution to streamline the entire process.

Why Use Flink?

Flink is used to construct the real-time data warehouse platform that implements the following functions, which are also the main concerns of the real-time data warehouse platform:

  • Expression: Flink provides a wide range of multi-layer APIs, including the Stream API,
  • Batch and stream processing: Flink makes batch and stream processing possible.

Real-time Data Warehouse Platform

Construction Ideas

  • UDF quality: The platform manages templates, cases, and tests for users, shields users from the compiling, packaging, and JAR package management processes, and performs metric log tracking and exception handling in the UDF template.
  • UDF reuse capabilities: The UDFs developed by a service provider may be used by other service providers, but incompatibility may occur in the upgrade process. Therefore, the platform provides project, function, and version management for businesses.

3) Future Development and Considerations

Automatic Resource Optimization

In terms of real-time computing, Meituan-Dianping’s real-time computing platform has thousands of nodes, which may be expanded to tens of thousands of nodes in the future. This poses new needs for resource optimization. A real-time task may require many resources during peak hours but less resources during off-peak hours.

Upgrading the Real-time Data Warehouse Construction Process

The construction of a real-time data warehouse includes the following steps.

  • Implement intelligent modeling based on business requirements to design an automatic modeling solution.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store