Large-Scale Computing with MaxCompute: From Digital Alibaba to Digital Cities
Tony Guan, Senior Staff Engineer of Alibaba Computing Platform, elaborated the exploration and practice of MaxCompute, Alibaba’s unified ultra-large-scale data computing platform, in terms of its computing capabilities, federated computing, intelligence, and enterprise-level serviceability at The Computing Conference 2018 in Hangzhou. MaxCompute is evolving rapidly from digital Alibaba, to digital enterprises, and then to digital cities, accelerating technology inclusiveness and driving China to be a digital country.
Apsara 2.0 MaxCompute is the distributed computing part of Alibaba’s Apsara system, which was developed nine years ago. Nowadays, MaxCompute is capable of storing exabytes of data and computing hundreds of petabytes of data daily. In the public cloud, MaxCompute has been introduced to more than 10 countries and regions around the world; in Apsara Stack, MaxCompute has been deployed on over 100 servers, including ET City Brain.
Zhou Jingren, vice president of Alibaba Group, said in an interview at the Computing Conference that Alibaba has been engaged in big data and cloud computing since 2008, which were originally expected to support the core e-commerce business. As Alibaba’s business expands, the big data platform has been evolving at a high speed. This platform is the predecessor of MaxCompute. So computing platform products have achieved initial success within Alibaba and played a vital role in Alibaba’s entire business. After intensive and extensive verification, we wished to benefit the world with the same technologies. Therefore, we make products available on Alibaba Cloud to serve enterprise users in various industries. Alibaba Cloud computing platform is the most technologically experienced with a long history, at least in China. MaxCompute is a big data intelligent computing platform that has passed various business tests and provides true enterprise-class serviceability.
Min Wanli, Alibaba Cloud’s chief machine intelligence scientist, said in a speech in the main forum at the Computing Conference, “MaxCompute is an extremely important part of the ET brain blood supply system. It is our biggest treasure. Without MaxCompute, I may not be able to tell you any successful case here today.” ET City Brain monitors signal lights in urban intersections automatically. It is not magic. It is MaxCompute that enables large-scale distributed computing. The greater the data volume, the more complicated the scenario, the more capable large-scale computing appears.
The following is a transcript of Tony Guan’s speech.
From Digital Alibaba to Digital Cities
Thank you everyone, I am Tony Guan. Hangzhou City Brain is a brand-new platform that supports Alibaba’s advancement. We first expected to digitize Alibaba. Later, we wished to digitize enterprises. Now we are digitizing a city.
Let’s look at what has happened in the process of digitizing a city, with a magnifying glass. We just mentioned 1,300 intersections. We digitized 1,300 intersections, involving 4,500 cameras. Each camera generates 24 frames per second. Each frame of data is actually a high-definition picture, which is 1920x1024x24-bit color depth of field. Each frame of unencrypted data is about 50 MB. The City Brain identifies vehicles, license plates, pedestrians, and violations (such as driving cars on lines). Then, it analyzes frames to determine vehicle speeds and intersection congestions and computes the possible congestion index based on the data frames collected between intersections. City digitalization poses extremely high requirements for data and computing. To count capably, quickly, and accurately, we need a powerful computing platform.
Apsara 2.0 MaxCompute is the distributed computing part of Alibaba’s Apsara system that was developed nine years ago, consisting of distributed storage, distributed scheduling, and distributed computing. Nowadays, MaxCompute is capable of storing exabytes of data and computing hundreds of petabytes of data daily. In the public cloud, MaxCompute has been introduced to more than 10 countries and regions around the world; in Apsara Stack, MaxCompute has been deployed on over 100 servers, including ET City Brain, amounting to 100,000 servers. MaxCompute provides all required computing capabilities, including City Brain.
MaxCompute serves heterogeneous computing clusters at the bottom layer of its system architecture, including CPU clusters, GPU clusters, FPGA clusters, and future intelligent hardware clusters. The clusters are located in different places and are linked by unified metadata management and unified scheduling. These 100,000 servers resemble a single computer from users’ perspective. MaxCompute provides a series of computing capabilities including batch computing, stream computing, memory computing, machine learning, and iteration. The computing platform strongly bolsters Alibaba economy and Alibaba Cloud computing capabilities.
What Can MaxCompute Do?
Today, I’d like to illustrate the following points about MaxCompute:
Firstly, computing capabilities are the core indicator for a computing platform.
In the 2015 GraySort competition, we sorted 100 TB data in 377 seconds, breaking the 1,406-second record set by Apache Spark and winning the world championship. In 2016, we were the winner of CloudSort again, which proves that MaxCompute is not only fast but also cost-effective. In 2017, MaxCompute challenged 30 query indicators of TPCX-Bigbench 100 TB, becoming the world’s first computing engine to pass the test. In 2018, we doubled our performance on the same 100 TB scale. In addition, our performance proves more than three times that of peer open source products on the ultra-small 10 TB scale through comparative analysis. This is continuous upgrading of computing capabilities. It is also the required computing power when a digital flood peaks. Computing becomes more cost-effective, and intelligence gets more inclusive.
Secondly, computing push-down is more efficient than data move-up.
The data migrated to cloud is generally stored in different systems. Online service data is typically stored in the database to support front-end services, while semi-structured logs and unstructured audios, videos, and images are stored in the data lake. The front-end database provides abundant indexes, which ensures efficient computing. The back-end provides super-large-scale storage. The big data system in the middle stores structured data in a columnar manner for super-large-scale data computing.
This is a challenge for us because users expect to compute all data together to achieve an optimal computing result.
To address this challenge, two solutions are available. One is the so-called data move-up, which uploads all data to a single system and triggers computing after the uploading is synchronized. This solution has three problems: data redundancy resulting from one or two backups of the same data; synchronization delay, which disables computing in the delay time; and reduced real-time performance.
Therefore, we propose the concept of federated computing, which enables computing push-down in a more efficient way than data move-up. What is federated computing? Federated computing links all systems together through one job in a big data system, without data synchronization. For example, I can join a job in the database system to a big data system. If filtering or aggregation occurs during the joining, some computing tasks are pushed down to and executed by the database system. Federated computing links a job to multiple systems, with each part depending on optimal decisions at that time, and interworks data at this level.
Thirdly, Auto Data Warehouse implements big data autopilot.
This new feature may be available on the public cloud this year. Alibaba faced a huge practical challenge five years ago, when data exploded and multiplied every year. Currently, each Alibaba employee has about 100 jobs on average, and five or six million jobs are running in the system every day. If these jobs are distributed evenly to data platform workers, each worker is responsible for hundreds of thousands of tables and tens of thousands of jobs. The human brain can hardly understand the inter-data and inter-job relationships which are rather complicated.
What should we do? At that time it was difficult to tell whether any data is redundant, whether computing is reusable, and what if a job fails or some data is incorrect. To resolve the problems, we developed a system five years ago. We started with the most basic data, formulated a data map, identified data relationship based on data lineage, analyzed the hot and cold data separation, and optimized data automatically. Finally, when a new data sheet is generated, it automatically identifies its associations. We integrate this system into the data autopilot system, Auto Data Warehouse. According to the practice within Alibaba, this system optimizes calculation by 35%, reduces 20% storage at the data deduplication level, and increases the computing efficiency by more than 75% at the resource planning level. This system has been very successful in Alibaba and will be available on the public cloud this year.
Fourthly, MaxCompute is a full suite of enterprise services, not just a single engine.
Let’s go back to Hangzhou City Brain. This platform has evolved from a basic platform to a data support system for Hangzhou. A system failure is likely to affect the national economy and people’s livelihood. In addition to computing capabilities, this system features high stability, disaster recovery, and resilience so that the system recovers quickly when traffic is congested due to a large-scale flow of people. The system sends warnings and recovers automatically in unexpected situations when networks break down due to physical reasons, such as Typhoon Mangkhut. All the capabilities are summed up in one word, “enterprization”. MaxCompute is more than just an engine, but a comprehensive platform that provides a full suite of enterprise services.
In addition to computing, MaxCompute provides an account system and a project management system. The account system appears to be simple, while account separation is critical when a platform is shared among tens of thousands of people in an enterprise. The data security system sets attributes and labels for data, for example, high priority, low priority, high confidentiality, low confidentiality, high privacy, and low privacy. System security is required for granting data access to systems and users. The monitoring system triggers warnings before a fault occurs or is likely to occur so that users take actions in advance.
We launched DQC this year, which helps identify and rectify errors when data becomes complicated. For example, DQC checks and verifies incorrect jobs at several levels and ensures data correctness through a series of rules. We also launched a multi-cluster disaster recovery system that has been implemented in some Apsara Stack projects. This system is compatible with the financial disaster recovery system and complies with China’s first generation of financial laws and regulations. Development is as important as computing capabilities, federated computing, and intelligence. In addition to more than 10,000 developers within Alibaba, tens of thousands of enterprise users are developing on this Alibaba Cloud platform, affecting hundreds of thousands of people. The development efficiency is crucial. This year we have upgraded the DataWorks development and debugging platform and developed a data integration system and a pipline management system. The systems can link with machine learning, data analysis, and BI platform. We also launched App Studio. Apart from the engine, we have integrated all enterprise-class computing services.
We have integrated computing capabilities, federated computing, intelligence, and enterprise-class serviceability into a complete big data platform. We will continuously promote platform and product development using technologies, to enable enterprises and the society with sufficient computing capabilities and the ability of continuous rapid evolution to drive China to be a digital country.