MaxCompute: Serverless Big Data Service with High Reliability

Big Data in Alibaba

First, let’s see some background information about big data technologies at Alibaba. As shown in the following figure, Alibaba began to establish a network of big data technologies very early, and it was safe to say that Alibaba Cloud was founded to help Alibaba solve technical problems related to big data. Currently, almost all Alibaba business units are using big data technologies. Big data technologies are applied both widely and deeply in Alibaba. Additionally, the whole set of big data systems at Alibaba Group are integrated together.

Overview of Alibaba Cloud’s Computing Platform

The Alibaba Cloud Computing Platform business unit is responsible for the integration of Alibaba big data systems and R&D related to storage and computing across the whole Alibaba Group. The following figure shows the structure of the big data platform of Alibaba, where the underlying layer is the unified storage platform — Apsara Distributed File System that is responsible for storing big data. Storage is static, and computing is required for mining data value. Therefore, the Alibaba big data platform also provides a variety of computing resources, including CPU, GPU, FPGA, and ASIC. To make better use of these computing resources, we need unified resource abstraction and efficient management. The unified resource management system in Alibaba Cloud is called Job Scheduler. Based on this resource management and scheduling system, Alibaba Cloud has developed a variety of computing engines, such as the general-purpose computing engine MaxCompute, the stream computing engine Blink, the machine learning engine PAI, and the graph computing engine Flash. In addition to these computing engines, the big data platform also provides various development environments, on which the implementation of many services is based.

MaxCompute System Architecture

The system architecture of MaxCompute is very similar to those of other big data computing engines in the industry. As shown in the following figure, you can use clients to ingest and transfer data into MaxCompute through the access layer. In the middle is the management layer, which can manage a variety of user jobs and host many features, such as SQL compilation, optimization, execution, and other basic features in the big data field. MaxCompute also provides distributed metadata services. If metadata management is unavailable, we cannot know what the stored data is exactly about. The bottom layer in the MaxCompute architecture is where the actual storage and computing takes place.

MaxCompute Serverless Technology

Currently Serverless is a very popular concept. The popularity of Serverless is both gratifying and depressing to the developers of MaxCompute. It is gratifying because the MaxCompute team had begun to develop similar features long before the Serverless concept was born and the design concept of the team is fully consistent with Serverless. This indicates that MaxCompute has an early start regarding Serverless and certain technology advances. It is also depressing because although the MaxCompute team put forward ideas similar to Serverless and realized the value of this technology, they did not package such capabilities as early as possible.

  • How to continuously iterate and upgrade big data services
  • Big data services trend: automation and intelligence
  • Data security

Challenges and Solutions for Continuous Improvement and Release

Continuous improvement and release is definitely an essential part of a big data system. A big data platform needs to be continuously upgraded and improved to meet a variety of new needs from users. During the process of making improvements, the big data system also needs to process millions of jobs running on the platform each day and ensure 24/7 services without interruption. How can we ensure a stable and secure platform during the upgrading process and make sure that the release of new features does not influence users? How can we ensure that the new version is stable and does not have bugs or performance regression? How can we quickly stop further loss after a problem occurs? All these questions need to be taken into consideration. In addition, we need to consider conflicts between testing and data security during continuous improvement and release. In other words, implementing continuous improvement and release on a big data platform is like changing the engine of a flying airplane.


Playback is developed to allow MaxCompute to quickly improve the expressiveness and performance optimization level of the compiler and optimizer. Ensuring that the compiler and optimizer are problem-free is a new challenge that came up after significant improvements were made to the compiler and optimizer. The original method is to have existing SQL statements run on the compiler and optimizer and then manually analyze the execution results to find problems resulting from the compiler and optimizer improvements. However, manual analysis is highly time-consuming and therefore unacceptable. So the MaxCompute team wants to implement the self-verification of the improved optimizer and compiler by utilizing the big data computing platform.


Flighting is a runtime tool. In fact, the most common and natural practice for testing the improved optimizer is to create a test cluster. However, this method can waste lots of resources due to the high cost of the test cluster. A test cluster also cannot simulate real scenarios because the workload in a test environment and the workload in an actual scenario are not the same. To perform the verification in a test environment, you have to create some test data, which may be too simple. Otherwise, you need to collect data from real scenarios. This is a complex process because it requires consent from users and data desensitization In addition to risking data breaches, this method may lead to some differences between the final data and the real data. Therefore, in most cases it is impossible to verify and find problems with the new features.

Phased Release

MaxCompute provides the phased release feature that can prioritize tasks for fine-grained release and support instantaneous rollback to minimize the risk.

Automation and Intelligence

With the development of artificial intelligence and big data technologies, today services need to implement not only high availability but also automation. We may consider this requirement from two perspectives: services themselves and users’ maintenance. Services themselves need to implement automation first and then artificial intelligence.

The Basis for the Low Cost of MaxCompute

MaxCompute’s low cost makes it more competitive. The low cost of MaxCompute is enabled by the technology dividend. To lower the cost of the big data tasks, MaxCompute makes some improvements on three key aspects: computing, storage, and scheduling.

  • In terms of computing, MaxCompute optimizes its computing engine to improve performance and reduce the number of the resources used by jobs. The two practices can make the platform host more computing tasks. The resource usage of computing tasks is finally reduced because each computing task can reduce either the computing time or the job size. This spares more resources and allows the platform to host more computing tasks.
  • When it comes to storage, a variety of optimization methods are also available. For example, optimize the compression algorithm, use the advanced columnar storage, classify data storage, archive cold data, and recycle data that is of little value.
  • For scheduling, the main optimization methods to lower the cost include intra-cluster scheduling and global scheduling across clusters.

Intra-Cluster Scheduling

How to combine these two aspects is very critical and requires powerful scheduling capabilities. Currently the MaxCompute platform can make 4,000+ scheduling decisions per second. MaxCompute makes a scheduling decision by putting a certain number of jobs into corresponding resources and ensuring that the resources required for the jobs are sufficient and that no extra resources are wasted at the same time.

Two-Tier Quota Group

For scheduling within a single cluster, we mainly explain the two-tier quota group that is rare in common big data platforms. Before the introduction to the two-tier quota group, let’s first look at some information about quota groups. To put it simply, a quota group is a group that contains some integrated resources after various resources in physical clusters make up a resource pool. On the big data platform, jobs are running in quota groups. Jobs within the same quota group share all the resources in that quota group. The common one-tier quota group policy has many problems. For example, consider a one-tier quota group that has 1,000 CPU cores and is used by 100 users. At a specific point in time, a user submits a very large job that uses up all the 1,000 CPU cores in that quota group, putting all the jobs of other users on the waiting list.

Data Security

When an enterprise uses a big data computing platform, its biggest concern is data security. Judging from Alibaba’s experience, users care about the three following date security issues most: Can others see the data after it is stored on a big data platform? Can the service provider see the data after data is put on the platform or service? What if problems occur on the big data platform that hosts my data?



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: