DAG 2.0: The Adaptive Execution Engine That Supported 150 Million Distributed Jobs in the 2020 Double 11

Preface

This Double 11, the largest number of jobs were run with the least manual intervention in the history of big promotional events. The technical support for this Double 11 featured dynamic adjustment, intelligent data orchestration, and the bubble execution mode. The dynamic adjustment feature reduced the number of scheduled compute nodes by 1 billion on November 11 alone. The intelligent data orchestration feature expedited the execution of a single high-priority job on an important pipeline by dozens of hours. The bubble execution mode improved the performance of millions of jobs by more than 30%.

Challenge and Background

Double 11 is undoubtedly the best opportunity to verify the stability and scalability of the distributed scheduling and execution frameworks of Alibaba Cloud core computing platforms. On a regular day of this year, more than 10 million distributed jobs are scheduled and executed on the computing platforms. During Double 11 of this year (from November 1st to November 11th), the total number of jobs exceeded 150 million, the peak number of jobs per day exceeded 16 million, and the amount of data processed per day exceeded 1.7 EB. As Alibaba Group CTO Lusu said, “The ‘abnormality’ of Double 11 eventually becomes the ‘normality’ of regular days.” Take the number of distributed jobs scheduled and executed on computing platforms per day during Double 11 as an example. This number increases by more than 50% compared with last year, and the peak of the previous Double 11 eventually becomes commonplace in the coming year.

Number of distributed jobs running on computing platforms per day

Adaptive Dynamic Execution: A “Different” Double 11

The effectiveness of an execution plan for distributed jobs depends on how an engine optimizer predicts the characteristics of data. However, online data is complex and dynamic, and different processing strategies are required for different scenarios. These factors significantly affect the prediction accuracy of an optimizer.

Adaptive Shuffle: Intelligent Data Orchestration to Avoid Data Skew

Data skew has always been a pain point of distributed systems. All the components across the data processing link, including storage devices, optimizers, and compute engines, have adopted multiple methods to avoid data skew. However, data skew is almost impossible to eliminate due to the characteristics of data distribution. During Double 11, data skew occurs more frequently. For example, the sales volume of hot items is skyrocketing, which causes data skew. Most transactions are completed in the early morning of November 11th, which also results in data skew.

Prolonged job execution caused by data skew
Intelligent data orchestration based on adaptive shuffle to avoid data skew
Distribution of the 130,000 jobs for which adaptive shuffle eliminated data skew issues

Adaptive Parallelism Adjustment: Dynamic Data Distribution at the Partition Level to Optimize Resource Usage

In the industry, the most common method to dynamically adjust parallelism for distributed jobs is based on the application master (AM). To estimate parallelism, the AM obtains the total amount of output data from upstream nodes and divides the total data amount by the data amount that each compute node is expected to process. If the number of compute nodes in the original execution plan exceeds the number of nodes required to process the actual output data, the AM performs a scale-down of the parallelism for dynamic adjustment. The specific scale-down ratio is based on the original parallelism and adjusted parallelism. The AM also uses a single compute node to merge adjacent data partitions. This is a simple method and is suitable only when data is evenly distributed. However, it is almost impossible to achieve an even data distribution for production jobs, especially during Double 11. If this method is used to process jobs such as these, serious data skew may occur after the data is merged.

Simple dynamic parallelism adjustment vs adaptive dynamic parallelism adjustment

Conditional Join: The Optimal Choice for Real-Time Join Operations

For traditional big data jobs, the execution plan is typically determined by the optimizer before the jobs are submitted and cannot be adjusted when the jobs are in progress. This mechanism requires superior prejudgment from the optimizer, a capability that is often unfulfilled in the actual execution of production jobs.

Different join algorithms in distributed SQL
Implementation of conditional joins on DAG 2.0
Dynamic execution plan adjustment by using nested conditional joins

Next-Generation Execution Framework: Higher Flexibility and Efficiency

This year the computing platforms upgraded the quasi-real-time execution framework to 2.0 based on the DAG 2.0 execution engine. In the framework 2.0, resources and jobs are managed separately and the job-managing system component supports horizontal scaling, contributing to the successful handling of the data generated on Double 11. In addition, the DAG 2.0 execution engine tunes performance based on the implementation logic of the core state machine that is used for event processing, so as to optimize execution of jobs on various scales. This optimization enhances the performance of more than tens of millions of jobs that run every day.

System Metric Optimization

As the quasi-real-time execution framework 2.0 was widely used and continuously optimized, the overheads caused by job execution on offline computing nodes in Double 11 2020 decreased by 2.8 times compared with last year. The proportion of the job running time that generates overheads to the total running time of computing nodes was also reduced by several times. The reduction in scheduling overheads saved the costs of thousands of physical machines for Alibaba Group. The total number of quasi-real-time jobs that were running on a single day exceeded 10 million, up by 65% compared with last year, whereas the overheads decreased by 1.9 times. The proportion of sub-second jobs increased by 2.4 times. The probability of quasi-real-time jobs returning to the batch mode reduced by 2.9 times.

Centralized Management of High-Load Clusters

The stress testing conducted before Double 11 2020 found that in some large clusters, the QPS of quasi-real-time services reached the upper limit due to the soaring number of jobs. This caused a large number of jobs to return to the batch mode. If the job parallelism on a Fuxi master reaches the upper limit, these batch jobs need to queue for a long time before being processed. To address this issue, the quasi-real-time execution framework 2.0 allows the job-managing system component to scale horizontally, so that more jobs can be executed in quasi-real-time mode, which significantly improves job performance and shortens queuing time. This enables the system to efficiently cope with the sharp increase in data volume on Double 11 without requiring additional cluster resources.

Quasi-real-time execution framework 2.0
Prevented impact of peak traffic on Apsara Name Service and Distributed Lock Synchronization System

Support for the PAI Elastic-Inference Engine

In addition to MaxCompute jobs, DAG 2.0 provides native support for TensorFlow and PyTorch jobs running on the PAI platform. Compared with DAG 1.0, DAG 2.0 provides accurate semantic descriptions of jobs and offers greatly improved dynamic capabilities and fault tolerance. Before Double 11 2020, the PAI team worked together with the scheduling and execution framework team to develop the PAI elastic-inference engine that is implemented based on resource scaling. This brand-new engine significantly improves the overall performance of inference jobs on the PAI platform.

Trend of execution duration of video feature extraction tasks performed by an algorithm recommendation team

Support for the Bubble Execution Mode

DAG 2.0 uses the all-in-one execution mode to unify the batch execution mode and quasi-real-time mode based on the flexible DAG hierarchical model. This DAG model lays the foundation for the bubble execution mode that is developed during FY21 of Alibaba. In this mode, a variety of jobs can be executed with balanced resource consumption and superior performance.

Bubble execution mode based on the DAG model
Latency and resource consumption comparison between the batch execution mode and bubble execution mode

Outlook

DAG 2.0 features an upgraded architecture, and the execution engine is developed with the goal to become the foundation for the long-term development of computing platforms. DAG 2.0 also aims to support the combination of upper-layer computing engines and distributed scheduling to implement various innovations and create a whole new computing ecosystem. The architecture upgrade was completed during Double 11 2019, marking a big step forward. In 2020, a number of new features were developed based on the new DAG 2.0 architecture and have been implemented, including adaptive dynamic execution and new computing modes. These new features have helped Alibaba Group once again withstand the test of the Double 11 Global Shopping Festival.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store