Progress and Prospects of Alibaba’s Big Data Technology

14 min readMay 14, 2020

During the session on big data technology at the 2019 Apsara Conference held in Hangzhou, Guan Tao, Senior Director of Alibaba Cloud Computing Platforms, delivered a speech titled “Progress and Prospects of Alibaba’s Big Data Technology.” This article describes the shift of customer value in the big data domain from the perspective of Alibaba, and then summarizes the development focuses of core technologies. Finally, it describes the work involved in building an intelligent big data platform, from engine optimization to “autonomous driving,” with several typical cases.

The following introduces highlights from Guan Tao’s lecture.

1. The Shift of Value Proposition across the Big Data Domain

The 10-year History of Big Data: from “Early Adoption” to “Mainstreaming”

Big data technology has been around for 20 years, and Alibaba’s Apsara platform has a history of 10 years. The preceding figure shows the Hype Cycle for Emerging Technologies published by Gartner, a world-famous research and advisory company. Emerging technologies refer to technological innovations seen in a certain period of time. The horizontal axis depicts the progression of an innovation in five stages, from its trigger and initial over-enthusiasm, through a period of disillusionment to continuous development. Each innovation on the Hype Cycle is assigned to a category that represents how long it will take to reach the plateau of productivity from its current position. In 2014, big data reached the end of the peak of inflated expectations. In 2015, Gartner retired big data from the hype cycle and many people were arguing about where big data should be positioned. The Gartner analyst Betsy Burton concluded, “Big Data…has become prevalent in our lives.” What she means is that big data is no longer a specific technology, but a comprehensive technological domain with a wide range of common applications. Alibaba is of the opinion that big data moved from the early adoption stage to the mainstreaming stage around 2014 and has since driven many changes in value proposition.

The Shift of Value Proposition across the Big Data Domain

The preceding figure shows a comparison of the early adoption and mainstreaming stages of big data. In the early adoption stage, ease of use is the top priority. Agility is the next thing to concern users. It is very important to quickly make adjustments and modifications to meet business needs when platforms, supporting components, and toolchains are still immature. Finally, the applications of the technology need to achieve certain goals. However, these applications do not have to be comprehensive or even stable. They are acceptable if they can be tested through trial and error. The characteristics of the early adoption stage are very different from and can even be contradictory to those of the mainstreaming stage. In the mainstreaming stage, costs and performance become critical. This is especially true of costs as surveys show that users are very concerned about costs. Users not only want to know how much they will have to pay for big data processing, but also how to control costs in the event of massive data volume growth. In the mainstreaming stage, when applications reach large scales, the ability to offer enterprise-grade services becomes crucial. For example, Alibaba’s big data platform generates Alipay statements for merchants every day. Therefore, the settlement systems between merchants and merchants, merchants and their vendors and clients, and merchants and banks have zero tolerance for errors. The shift from the early adoption stage to the mainstreaming stage requires a rich and complete toolchain and ecosystem and the integration of these two components to achieve a high level of overall performance.

Development History of the Apsara Platform: An Alibaba Perspective

MaxCompute is the underlying system of the Apsara platform and supports most of Apsara’s data storage and computing demands. In partnership with Oracle, Alibaba began to build its data warehouse, mainly for accounting and internal use, in 2002. Four years later, Alibaba had the largest Oracle rack in Asia. In 2008 and 2009, Alibaba launched the Hadoop and Apsara systems respectively, followed by the well-known Moon Landing system. In 2015, the Moon Landing system was completed, all data was brought together, and a data base was established as a unified storage system, an intermediate Unified Computing System, and a data mid-end. The entire system is centered on Data Mid-End and represents the integration of big data within Alibaba. In 2016, the MaxCompute 2.0 project was launched, which almost completely replaced the system used from 2010 to 2015 and began to provide services to domestic cloud computing customers. In 2019, with the launch of MaxCompute 3.0, in addition to performance and costs, more attention is given to the challenge of ultra-large data volumes that have made it impossible for real humans to make optimizations in the big data field. This means it is difficult for mid-end engineers to manually perform modeling and optimization for the mid-end. Alibaba believes that artificial intelligence (AI) is the future and that the ability to optimize big data through AI technology is crucial.

2. Development of Core Technologies

Analyzing the Development of Core Technologies from Four Perspectives

High performance and low costs — This requirement can be met at four levels including computing, storage, resource utilization, and governance.
Enterprise-grade services — This means enterprise-grade stability, scalability, and disaster recovery capabilities.
Ecosystem and standardization — This mainly refers to the integration of ecosystems with standards.
Intelligence

The MaxCompute Big Data Cost Curve

The preceding figure shows the results derived from surveys of hundreds of customers. The yellow curve represents the business growth of the company and departments, and the blue curve represents the process of big data application. In the first year, the curve rises slowly. In the mainstreaming period, development begins to take off when people uncover the technological uses and value of big data. Early in this stage, the curve is not flat but steep, indicating rapid growth.

With this trend comes a problem. The growth rate of data volumes, computing workloads, and financial expenses has outpaced the growth of the existing business and may continue to rise. At this time, an appropriate system, coupled with good optimization and governance measures can slow down the growth of data volumes to bring it into line with applications and development, while ensuring sustainable costs. For example, when the business volume increases five-fold, the costs only double. If you cannot bring down the data size, the data center will become a cost center. In such a case, the value of massive data and computing workloads will not be apparent. To solve this problem, it is necessary to provide better service capabilities with high performance and low costs to reduce the costs at the platform layer and enable better data governance. In addition, big data can also be optimized through intelligent approaches.

Building a High-performance and Cost-effective Computing Platform

Alibaba faces four challenges in building a high-performance and cost-effective computing platform:

Costs will continue to rise after a single cluster can support 10,000 servers.
Hardware investment grows faster than the business due to the explosive growth of data and computing.
The technological development of medium-and large-sized companies has entered uncharted territory beyond the scope of existing open-source software.
Overall utilization stays low with numerous small clusters which can not form big clusters.

To address these challenges, Alibaba has made the following improvements to its computing platform:

Engine optimization: The core engine is completely proprietary and capable of control and continuous optimization.
Storage optimization: We ensure that there is no duplicate data, while also enabling intelligent storage classification (1.6) and compression classification.
Resource optimization: The unified cloud-native resource pool and corresponding load leveling work with offline hybrid deployment. One noteworthy point is that resource optimization is more important than job optimization. Alibaba no longer focuses on job performance and speed, but cares more about increasing overall resource utilization.
Data and computing management and governance.

The preceding figure shows an example of Double 11 from 2015 to 2018. The left shows the daily job volume, the middle shows the daily processing data volume, and the right shows the cost curve. This provides that through the Apsara platform and its technical capabilities, Alibaba has brought its business growth rate into line with its cost growth rate.

In addition, we have made more optimizations:

Engines:

NativeEngine + LLVM CodeGen and Vectorization + SIMD are used.
CBO + HBO and Dynamic DAG are used.
Rich structured data is introduced to better handle massive amounts of input and shuffling data.
Data can be stored by range or hash, and primary index and order are supported.

Storage: Compatible with open-source Apache ORC. With the new C++ writer and improved C++ reader, reading performance is 50% faster than that of CFile2 and open-source ORC.

Resources: A set of computing and scheduling capabilities across clusters. The servers in multiple clusters are aggregated into one computer.

Scheduling system optimization: The average cluster utilization rate is 70%. In addition to single-job metrics, the throughput rate of the overall cluster is emphasized.

Servers: The utilization rate of online servers is increased by more than 50% by using hybrid deployment technology. It also supports business elasticity to cope with Double 11.

The following are some relevant data and cases:

In 2015, MaxCompute won the GraySort (100 TB) category in Sort Benchmark.
In 2016, Elastic MapReduce (EMR) won the CloudSort (100 TB) category in Sort Benchmark.
In 2017, MaxCompute + PAI became the world’s first platform to pass the TPCx-BigBench (100TB) benchmark test.
In 2018, MaxCompute + PAI more than doubled the TPCx-BigBench benchmark and maintained its leadership in this area.
In 2018, Alibaba’s in-house Flink version outperformed the community version several times over. We open-sourced this in-house version in 2019.
In 2019, EMR delivered the world’s highest speed in the TPC-DS (10 TB) benchmark tests.
In 2019, MaxCompute + PAI continued to improve metrics and maintained its world leadership by doubling its performance and halving costs at the 30-TB scale factor.

The preceding figure shows the TPCx-BigBench (100TB) benchmark test results for MaxCompute from 2017 to 2019. As you can see, its QPM basically doubled every year.

As shown in the preceding figure, compared with other systems in the industry, MaxCompute delivers twice the performance at only half the costs.

Building a Versatile Enterprise-grade Computing Platform

The construction of a versatile enterprise-grade computing platform at the system backend can be broken down into four parts:

1. Reliable data joining points (data chassis) are required. Many companies view their data as an asset, and therefore data security is crucial. Specifically, this includes the following content:

Exabyte-level scalability: A single cluster, multiple clusters, and global deployment can all be scaled.
Data reliability: The availability of data is not sufficient, absolute security is required, such as disaster recovery for data centers.
Security: Data security needs to be achieved in every aspect including storage, computing, management, and O&M.

2. Disaster recovery is the responsibility of enterprises themselves. They can achieve certain capabilities by selecting different disaster recovery offerings, including the following:

Cost-effective hardware
Self-service and automated O&M
Comprehensive fault tolerance (software, hardware, network, and human error)

3. Privacy breaches are a growing concern. Despite this, Alibaba has been able to maintain a perfect record in privacy protection due to its high standards for data management, sharing, and security. The following are the specific aspects:

Disaster recovery and backup
Fine-grained authorization, security guards, auditing, and storage encryption
Data governance capabilities, data lineage and tracking, and data lineage-based analysis and reports
Management and scheduling of multiple data and jobs
Scheduling ability based on baseline security assurance

4. The optimization of internal scheduling capability and scalability in the system includes the following:

Mega-scale and unified resource pool
Overselling
Baseline security
Scalability and hybrid deployment

Building a Computing Platform that Integrates Ecosystems

The preceding figure illustrates a case of Apsara and MaxCompute platform integration. The first layer is the unified storage layer, opening both the MaxCompute engine and other engines. The abstraction layer in the middle is the federated computing platform. Federation refers to the abstraction of data, resources, and interfaces into a set of standard interfaces, which can be used by Spark and other engines and form a complete ecosystem. The second line is the external ecosystem of MaxCompute. There are many types of data sources, which exist not only in Alibaba’s own storage, but also in database systems and file systems. In addition, users can include other systems to work without moving data. This is called federated computing.

Previously, Blink was a separate branch of the Flink community. As a system of best development practices within Alibaba, Blink became a default community with version 1.9, contributing to the SQL engine, scheduling system, and Algo on Flink. With the acquisition of a company affiliated with Flink, we will promote the development of Flink.

Finally, there is the development of the storage layer. The preceding figure shows the modifications of compression, read and write, and data-related formats. All the modifications will be pushed to the community. The orange font indicates the modification was made according to design standards.

3. From Engine Optimization to “Autonomous Driving”

Computing engine optimization involves not only the engine itself, but also its “autonomous driving” capability. The preceding figure illustrates the evolution of the Apsara system by using cars as a metaphor. The first phase emphasizes availability, such as whether the system can support the overwhelming workloads of Double 11. In the second phase, higher performance and lower costs are the goals of optimization. When it comes to the third phase, performance becomes the ultimate goal.

Auto Cloud Data Warehouse

Three key challenges have emerged within Alibaba:

It is difficult to manage exabytes of data and millions of jobs. This is beyond the ability of the Data Mid-End team, which means the traditional database administrator (DBA) model can no longer handle such workloads.
When different types of data are aggregated, it is impossible for humans to understand the value of these massive amounts of data.
After many years of big data system development, substantial progresses in big data can be achieved only by transforming the underlying structure of the system.

From the perspective of Auto Cloud Data Warehouse (CDW), three aspects can be optimized. The first aspect is efficiency, including HBO that is based on historical information. This process can be understood in this way: When a new job that the system is not familiar with runs in the system, the system allocates resources in a conservative manner to complete the job. The system tuning tends to be careful the first time, but gradually improves and better adapts to the job. Four days later, the job performance is much better. With HBO optimization, Alibaba’s resource utilization reaches 70%. In addition, learned statistics, reuse of intelligent computing, and intelligent data stratification are also included.

The second aspect is resource planning. When 100,000 machines in the cloud are distributed in different data centers, data planning and resource scheduling are carried out automatically rather than manually. This includes the automatic classification of job running modes, among which three are for very large jobs and highly interactive jobs. In addition, dynamic quota adjustment, scaling, job running prediction and automatic early alerts, automatic job upgrade and downgrade, data arrangement, and cross-cluster scheduling are also included.

The third aspect is intelligent modeling that includes the identification of similar jobs and data, automatic error correction, job running prediction, automatic early alerts, and automatic job upgrade and downgrade.

These three aspects have great development potential in the field of Auto CDW. In the preceding figure, the items followed by * are functionalities soon to be announced by Alibaba.

Auto CDW — Intelligent Index Recommendations

Cost modules are assimilated according to relationships between running jobs so that the adjustment solution with the best index can be found and then pushed. For example, based on MaxCompute, 300,000 fields in 80,000 tables are selected within Alibaba Group, and the best clustering solution for 44,000 tables is recommended, reducing costs by an average of 43%.

Auto Tiered Store — Cold and Hot Data Recognition and Management

On September 1, 2019, Alibaba reduced its overall storage prices by 30%. This was partly due to the Auto Tiered Store technology shown in the preceding figure, which enables automatic separation of hot and cold data. In the past, data was separated by two methods. In the first method, the system automatically performed cold compression, which could reduce costs by about two thirds. The second method was to allow users to add flags. However, when there are tens of millions of tables in the system, it is difficult for data development engineers to identify how the data is used. In this case, economic models can be used to build the relationship between access and storage, and the hot and cold levels can be automatically customized based on different partitions of different jobs. In this way, Alibaba was able to reduce the compression ratio from 3 to 1.6, improving the overall storage efficiency by 20%.

Yugong — Intelligent Global Data Arrangement and Scheduling

A cloud system relies on multiple data centers deployed around the world. However, data generation is related to businesses, and the association among data must not be broken. Therefore, questions, such as in which types of data centers data should be stored and what kinds of jobs should be scheduled to achieve optimal performance, concern optimal matching on a global scale. Internally, Alibaba integrated static job arrangement and dynamic scheduling to develop a system called Yugong. The two diagrams on the right of the preceding figure show the principles of this system.

DPSAaS — Data Sharing and Analytics Services Based on Differential Privacy

The computing of sensitive data is called secret computing. This makes privacy data computable but invisible. In the preceding figure, the first three columns contain sensitive data, whereas the last three contain non-sensitive data. All the sensitive data is masked by using differential privacy and therefore can not be queried, but the computing results are still correct. With this solution, Alibaba is trying to strike a balance between data sharing and privacy protection.

Other Future-oriented Explorations

Alibaba is exploring how to perform operations on graph-based relationships, find the optimal balance between systems, perform privacy-based computing, improve scheduling to meet multiple objectives, and achieve better results while drastically reducing data volume at the sampling layer.