Four Billion Records per Second! What is Behind Alibaba Double 11 — Flink Stream-Batch Unification Practice during Double 11 for the Very First Time
Step up the digitalization of your business during the Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!
By Wang Feng (Mowen)
Released by Apache Flink Community China
During this year’s Double 11 Global Shopping Festival, the peak traffic processing rate of Alibaba Cloud Realtime Compute for Apache Flink reached four billion records per second. The data volume also reached an incredible seven TB per second. This means that the budding Flink-based stream-batch unification has successfully withstood strict tests in terms of stability, performance, and efficiency in Alibaba’s core data service scenarios. This article shares the practice experience and reviews the evolvement of stream and batch unification within Alibaba’s core data services.
As Double 11 ended at midnight on November 12, the Gross Merchandise Volume (GMV) of the 2020 Double 11 Global Shopping Festival was US$74.1 billion. With the support of Flink, the GMV figure is exhibited steadily during the entire festival. The Flink-based real-time computing platform of Alibaba also successfully performed real-time data processing for the Alibaba economy. Once again, Alibaba successfully passed the annual test.
In addition to the GMV dashboard, Flink also provided support for many other critical services. These services include real-time machine learning for search and recommendation, real-time advertisement anti-fraud, real-time tracking and feedback of Cainiao’s order status, ECS real-time attack detection, and monitoring and alerting for massive infrastructures. The real-time business and data volume are increasing dramatically every year. This year, the peak rate of real-time computing reached four billion records per second, and the data volume reached an astonishing seven TB per second. This is equivalent to reading five million copies of the Xinhua Dictionary in one second.
So far, the number of Alibaba Cloud’s real-time computing jobs has reached more than 35,000. The cluster computing scale is also over 1.5 million CPUs, leading the way worldwide. By now, Flink has supported all the real-time computing needs of the Alibaba economy and has provided great insights to customers, merchants, and operation staffs.
In addition, Flink has practiced unification of stream and batch processing during this year’s Double 11 for the first time, and has successfully withstood strict tests in terms of stability, performance, and efficiency in Alibaba’s core business scenarios.
The First Practice of Stream-Batch Unificationin in Alibaba’s Core Data Scenarios
The unification of stream and batch processing started at Alibaba long time ago. Flink was first used in the search and recommendation scenario at Alibaba, in which, index building and feature engineering were based on the initial version of unified stream and batch processing in Flink. During this year’s Double 11, Flink promoted its capability of stream and batch unification to help the Alibaba data platform achieve more accurate data analysis and business decision-making by cross verifications between real time and offline data
Alibaba offers two types of data reports: real-time reports and offline reports. The former plays an important role in the Double 11 promotion scenario. It can provide real-time information in various dimensions for merchants, operation staffs, and managers. It can also help them make timely decisions to improve the efficiency of the platform and business. Taking real-time marketing data analysis for example, operation staffs and decision-makers need to compare the results of big promotion from different periods, such as the turnover at 10:00 am during the promotion day compared with the turnover at the same time yesterday. Through comparison, they can determine the current marketing effect and whether it is necessary to carry out regulation and control.
In the preceding scenario, two kinds of data analysis reports are required. One is the offline data report calculated based on batch processing every night. The other is the real-time data report generated by stream processing. Through comparing and analyzing the real-time and historical data, decisions can be made. Without unification of stream and batch processing,
offline reports and real-time reports are generated by a batch engine and a streaming engine separately. As a result, doubled development costs are required. It is also difficult to maintain consistent data processing logic for two separate engines, which makes it difficult to ensure processing result consistency. Therefore, the ideal solution is to use one engine to unify stream and batch processing for data analysis, so that offline and real-time analysis results will be consistent naturally. Flink continuously matures its stream and batch unification architecture and successful applied the architecture in search and recommendation scenarios. Hence, the Data Platform Team at Alibaba shows firm confidence and trust during the 2020 Double 11 Global Shopping Festival. The team worked with the Flink Team to promote the technical upgrade of the real-time computing platform. For the first time, Flink stream and batch unification has been successfully landed in the core data scenarios of Double 11.
This year, the stream and batch unification computing framework, jointly developed by the Flink Team and the Data Platform Team, made its debut in the Double 11 core data scenarios. It was also recognized by Peng Xinyu, Head of Alibaba’s Data Platform, because of its performance on the business layer. Technically, as a result of stream and batch unification, only one set of code is required for multiple computing processing modes. In addition, the computing speed after unification is twice as fast as other frameworks, and querying is four times faster. Thus, the speed of developing and generating data reports increases four to ten times. Moreover, the full consistency of real-time and offline data is achieved as a result of unification.
In addition to the advancement in business development efficiency and computing performance, the unified stream and batch computing architecture improves cluster resource utilization. After rapid expansion in recent years, Alibaba’s Flink real-time cluster now contains millions of CPUs, with tens of thousands of real-time computing tasks running. During the day, computing resources are occupied for real-time data processing. At night, the idle computing resources can be used for offline batch processing for free. The batch processing and the stream processing use the same engine and resources, significantly saving development, O&M, and resource costs. During this year’s Double 11, the Flink-based batch and streaming applications did not apply for any additional resources. The batch mode reused the Flink real-time computing cluster, which greatly improved the cluster utilization and saved a large amount of resource overhead. This efficient resource utilization mode also served as the basis for more subsequent business innovations.
Flink Stream-Batch Unification with Great Investment and Efforts
Next, let’s review the development of “stream-batch unification” from a technical perspective. Over ten years ago, Hadoop emerged as the first generation of open-source big data technology. MapReduce, the first-generation batch processing technology, solved the problem of large-scale data processing, and Hive allowed users to use SQL for large-scale data computing. However, as big data business scenarios gradually evolved, many applications had increasing demands for real-time data, such as social media, e-commerce transactions, and financial risk control. In this context, Storm emerged as the first-generation big data stream processing technology. Storm is completely different from Hadoop and Hive in architecture in that Storm’s computing model is based on messages, which can process massive data concurrently with millisecond-level latency. Storm solves the latency problem of MapReduce and hive. Thus, these are two different mainstream engines for batch and stream processing of big data computing with completely different patterns. This is called the first era of big data processing technology.
Later, the big data processing technology stepped into its second era when Spark and Flink emerged. Compared with Hadoop and Hive, Spark provided better batch processing capability and better performance. This allowed the Spark community to develop rapidly and gradually surpass Hadoop and Hive. Thus, Spark became the mainstream technology in the field of batch processing. However, Spark did not stop with batch processing technology. Soon, Spark also launched the stream computing solution, called Spark Streaming, and continuously conducted improvement. However, Spark engine is known as “batch processing-oriented” instead of a pure stream computing engine. It cannot provide the experience of stream and batch unification in terms of extreme short latency. Never the less, it was an advanced idea to achieve both stream and batch computing semantics based on a set of core engine technologies. Flink is another new engine with the same concept of stream and batch unification. Flink was officially released a bit later than Spark. Its predecessor was the research project Stratosphere from the Technical University of Berlin in 2009. Flink also aimed at supporting both batch and stream processing with one computing engine. However, Flink chose a different model. Flink adopted the engine architecture oriented aimed to “stream processing” and considered “batch” to be a kind of “bounded traffic.” Therefore, it was more natural to unify stream and batch processing based on the stream-oriented engine with no architecture bottleneck. In other words, Flink chose the “batch on streaming” architecture, which is different from Spark’s “streaming on batch” architecture.
The complete unification of stream and batch architecture of Flink was not constructed overnight. In the earlier Flink versions, stream and batch processing in Flink were neither completely unified in terms of API nor Runtime. Since Flink 1.9, Flink has accelerated the improvement and upgrade of stream-batch unification. Flink SQL, the most mainstream API for users, took the lead in achieving stream-batch unification semantics. This allows users to use only one set of SQL statements for stream and batch pipeline development, significantly reducing development costs.
However, SQL cannot meet all user requirements. For some tasks requiring a high degree of customization, such as fine-grained manipulation on states, users still need to use DataStream API. In common business scenarios, after submitting a streaming job, users will often create another batch job to replay historical data. Although DataStream can effectively meet various requirements of stream computing scenarios, it does not provide efficient support for batch processing.
Therefore since Flink 1.11, the Flink community started to focus on the improvement of the stream and batch unification capability over DataStream by adding batch processing semantics to the DataStream APIs. By applying the concept of batch and stream unification to the design of Connectors, Flink can connect DataStream APIs with different types of stream and batch data sources, such as Kafka and HDFS. In the future, the unified iterative APIs will be introduced into the DataStream APIs as well for machine learning scenarios.
From the functionality point of view, Flink is still a combination of stream computing and batch computing, using both SQL and the DataStream APIs. Users’ code is executed either in the stream mode or in the batch mode. However, some business scenarios place higher requirements on stream-batch unification with automatic switching between stream computing and batch computing. For example, in scenarios of data integration and data lake, full set of data in the database needs to be synchronized to the HDFS or cloud storage services first. Then, incremental data in the database needs to be automatically synchronized. Unified stream and batch ETL processing is performed during such synchronization. Flink will support more intelligent stream-batch unification scenarios in the future.
The Development of Flink-Based Stream-Batch Unification at Alibaba
Alibaba is the first enterprise in China to choose Flink. In 2015, the Search and Recommendation Team wanted to choose a new big data computing engine to meet the challenges of the next 5 to 10 years. The new computing engine would help with processing massive items and user data in the search and recommendation backend. Considering the high requirement for short latency in the e-commerce industry, the team hoped the new computing engine would be capable of both large-scale batch processing and millisecond-level real-time processing. In other words, the engine should be a stream-batch unified engine. At that time, Spark’s ecosystem had matured, and the stream-batch unification capability was provided through Spark Streaming. Flink was regarded as the top-level Apache project one year prior. It was an emerging project. After research and discussion about Spark and Flink, the team agreed that although the Flink ecosystem was not mature at that time, its stream processing-based architecture was more suitable for stream-batch unification. Therefore, the team quickly decided to build a real-time computing platform for search and recommendation based on the internal improvement and optimization on Flink within Alibaba.
After one year of hard work by the team, the real-time computing platform for search and recommendation based on Flink successfully supported the Double 11 Global Shopping Festival in 2016. Alibaba gained an understanding of the Flink-based real-time computing engine through its practice in core business scenarios. As a result, all of Alibaba’s real-time data services were migrated to the platform. After another year of hard work, Flink successfully the supported real-time data services of Alibaba during the 2017 Double 11 Global Shopping Festival, including the GMV dashboard and other core data service scenarios.
In 2018, Alibaba Cloud launched a real-time computing product based on Flink to provide cloud computing services for small and medium-sized enterprises. Not just to use Flink to solve its own business problems, Alibaba wanted to promote the development of the Flink open-source community, and make more contributions. Alibaba acquired Flink’s founding company Ververica in early 2019 and began to invest more resources in the Flink ecosystem and community. By 2020, almost all mainstream technology companies globally have adopted Flink for real-time computing. Flink has become a de facto standard for real-time computing in the big data industry.
The Flink community continues to conduct technical innovation. During the 2020 Double 11 Global Shopping Festival, the Flink-based stream-batch unification performed remarkably in the core marketing decision-making system of Tmall. In addition, Flink-based stream-batch unification successfully completed the stream-batch indexing and machine learning processes in search and recommendation scenarios. These facts proved the decision to choose Flink five years ago was a good one. In the future, we believe more companies will adopt Flink-based stream and batch unification.
The Technical Innovation of Stream-Batch Unification Promotes the Development of the Flink Open-Source Community
The continuous technical innovation of Flink-based stream-batch unification has promoted the rapid development of the Flink open-source community and the Flink ecosystem. As more companies in China adopts Flink, the Flink community in China is growing and becoming the main community worldwide.
The most obvious sign is an increase number of users. Since June 2020, Flink’s Chinese email list has become more active than the English email list. The influx of users and developers in the Flink community has led to higher quality code writers, effectively facilitating the development and iteration of the Flink engine.
The number of contributors in each Flink version has been increasing since Apache Flink 1.8.0. Most contributors come from major enterprises in China. Undoubtedly, developers and users in China are gradually becoming the backbone in developing Flink.
As the Chinese community continues to grow, Flink is more active in 2020 than ever before. In the 2020 Fiscal Year Report of Apache Software Foundation, Flink is the most active project of the year in terms of user and developer email list activities. Flink also ranks second in terms of code commitment number and the traffic on the GitHub homepage. It is not easy to achieve such results among nearly 350 top projects from the Apache Software Foundation.
Flink Forward Asia 2020: Unveiling Technologies behind Stream-Batch Unification
Flink Forward is the Flink technology conference authorized by Apache. This year, the Flink Forward Asia (FFA) Conference will adopt live broadcasts online for developers to learn open-source big data technology for free. Many leading Internet companies worldwide, including Alibaba, Ant Group, Tencent, ByteDance, Meituan, Xiaomi, Kuaishou, Bilibili, NetEase, Weibo, Intel, DellEMC, and Linkedin, will share their technical practices and innovation with Flink. Developers can learn directly from these companies.
Stream-batch unification will also be a hot topic at the FFA Conference this year. The Head of Data Technology from Tmall will share the practice and implementation of the Flink-based stream-batch unification at Alibaba. Audiences will know how stream-batch unification creates business value at the core scenarios of Double 11. Flink PMC and Committer experts of Alibaba and ByteDance will carry out in-depth technical interpretation about SQL and Runtime of Flink-based stream-batch unification. They will also share the latest technical progress of the Flink community. Gaming technology experts from Tencent will introduce application practices of Flink in the game “Honor of Kings.” The Director of Real-Time Big Data from Meituan will explain how Flink helps life service scenarios work in real-time. The Head of Big Fata from Kuaishou will share the development of Flink at Kuaishou. The machine learning technology experts from Weibo will introduce how to use Flink for information recommendation. In addition, Flink-related topics will cover finance, banking, logistics, automobile manufacturing, travel, and many other industries, presenting a flourishing ecosystem. Enthusiastic developers that are interested in open source big data technologies are welcome to attend the FFA Conference. During this conference, they will learn more about the latest technological development and innovation in the Flink community.
Official website of the conference: https://www.flink-forward.org/