DAS Was Polished and Enhanced to Provide Daily Support for Double 11

12 min readDec 29, 2020

The 2020 Double 11 Global Shopping Festival has come and gone, but the exploration of technology will never stop. The Double 11 Global Shopping Festival is not only a carnival for buyers but also a big test for data developers to examine the technological level and innovation practices of the Alibaba Cloud Database Technical Team.

Introduction

During the Double 11 Global Shopping Festival, the Database Autonomy Service (DAS) works for constant innovation to support this event and continue its ever-changing transformation. From the first tool-aided diagnosis for Database Administrators (DBA) to the concept of the Self-Driving Platform proposed in 2017, Alibaba Cloud has started to incubate and improve the autonomous capability of the database. In 2018, the autonomous capability of DAS gradually covered Alibaba Group’s entire network database instances, such as automatic SQL optimization, automatic storage, and automatic exception repair. In November 2019, Alibaba Cloud integrated HDM, CloudDBA, and autonomous capabilities into DAS to serve consumers in a better way. Based on the cloud platform, the developed technologies were centralized to serve Alibaba and our customers. This is another successful transformation.

Taking responsibility helps with a successful transformation. During the 2020 Double 11 Global Shopping Festival, DAS supported Alibaba Group and cloud customers with more stable and comprehensive autonomous capabilities. Based on database autonomous capabilities covering the entire network, a total of 49 million slow automatic SQL optimizations and 4.6 PB automatic storage optimizations have been realized within the Alibaba Group. Hundreds of thousands of instances in e-commerce and other scenarios have been covered with automatic exception repair with the coverage rate exceeding 90%. It successfully implements “1–5–10” exception self-repair capability, namely 1-minute detection, 5-minute location, and 10-minute recovery. This capability guaranteed the database stability during Double 11 and reduced a lot of labor costs. At the same time, DAS provided full-lifecycle support for hundreds of thousands of customers before the shopping festival, covering health inspection, risk identification, troubleshooting, and SQL optimization. DAS provided automatic SQL traffic limiting, automatic storage scaling, and dashboard management before Double 11. After the festival, it also offered on-site storage and a promotion summary. During the process, millions of inspection reports were generated and tens of billions of daily exception checks were implemented. By doing so, nearly 10,000 abnormal instances were found and automatically fixed and optimized, which helped customers’ experience during the Double 11 Global Shopping Festival.

DAS

The core concept of DAS is also derived from Alibaba’s consideration of the technological trends in database intelligence based on experience from the Double 11 Global Shopping Festivals. With the most experienced DBA in the industry and plenty of performance diagnostic data, Alibaba focuses on how to combine its abundant database knowledge, experience, and big data with machine intelligence technologies. By doing so, Alibaba intends to equip the database with an autopilot engine to maximize database stability and efficiency with the lowest possible cost. During the cloud-native era, application developers are expected to focus on business innovation to achieve high degrees of database autonomy, completely free from database O&M. The industry and academia have highly agreed on database autonomy. For example, Carnegie Mellon University (CMU) created Peloton, a data autonomy project, to achieve full data autonomy under mixed loads. With the concept of database self-driving and data driven by expert experience and machine learning, DAS enables databases with self-perception, self-recovery, self-optimization, and self-security. This is just like giving vehicles self-driving abilities.

The preceding figure shows the core concepts of DAS that are consistently followed and implemented throughout its design, R&D, and implementation:

Data-Driven: Develop detection capabilities to ensure situational awareness and exception detection in real-time by collecting massive real-time data, such as performance indicators, loaded SQL request logs, and O&M change logs.
The deep integration of machine learning and expert experience in the database field enables DAS automatic decision-making competence in different business scenarios.
Automatic Execution: Based on the decision made by the autonomous center, tasks are automatically orchestrated and executed.

Finally, the coordination was realized by creating the closed loop to develop autonomous capabilities. Those capabilities include exception detection, overall decision-making based on root cause analysis, the execution of repair and optimization decisions, continuous tracking and evaluation, performance feedback, and rollback. The entire loop supports the autonomous scenario without manual intervention. In addition, DAS supports continuous self-learning, such as automatic exception tagging, case systems, exception simulation, and quantitative feedback evaluation. Relying on the accumulation of various online business scenarios and cases, DAS accelerates self-evolution and continuously improves autonomy effectiveness.

Based on the preceding concepts, DAS has six core autonomous features: 7x24 real-time exception detection, fault self-recovery, automatic optimization, intelligent parameter tuning, automatic elasticity, and intelligent stress testing. The specific performance of these features is introduced below, combined with real cases from the 2020 Double 11 Global Shopping Festival.

DAS Core Autonomous Technologies

7x24 Real-Time Exception Detection

Based on machine learning algorithms, DAS carries out real-time exception detection towards workloads in the database. Unlike the traditional threshold-based alert method, it can detect database exceptions more timely. Massive online or offline data, such as hundreds of database performance metrics and loaded SQL request logs, were processed and stored in the data metric collection channel. Based on machine learning and database prediction algorithms, continuous model training of business database instances can be realized. Real-time model prediction and real-time exception detection and analysis can run smoothly in combination with the streaming and computing framework for big data analysis. Compared with the traditional rule-based and pre-value-based methods, real-time exception detection has the following advantages:

Broader detection scope, including detection indicators, SQL, logs, and locks
Quasi-real-time detection, which is much more efficient than the traditional way of exception detection
AI- and exception-based detection technologies rather than fault-driven detection
Periodic recognition, self-adaptive service features, and predictive capability

*Table: Time Series Characteristics in Common Workload Scenarios*

As shown above, the exception detection service can accurately and automatically recognize features in common workload scenarios, such as spike, seasonality, trend, and meanshift. It also supports the identification of multiple time series characteristics. When an exception is recognized, overall diagnosis and analysis based on the root cause are triggered, followed by exception recovery and autonomous scenarios optimization.

Fault Self-Recovery

Via the 7x24 real-time exception detection, instance exceptions in databases are recognized in real-time and automatically analyzed by DAS through root cause analysis. Additionally, loss prevention and recovery operations are performed to support the automatic recovery of the database, reducing the impact on enterprise businesses. Automatic SQL traffic limiting is a typical autonomous scenario, which is shown below from a real case during the 2020 Double 11 Global Shopping Festival:

*Figure: Automatic SQL Traffic Limiting Case*

As shown above, the number of active sessions and the CPU usage rate started to surge at 12:31 on November 5, 2020. The DAS exception detection center defined it as a database exception rather than a fluctuating spike at 12:33 and triggered the root cause analysis of SQL automatic traffic limiting, which was completed at 12:34. After analysis, two problematic SQLs were found. Once the problematic SQL was detected, automatic traffic limiting was initiated immediately, reducing active sessions. After submitting the problematic SQL, active sessions rapidly recovered, and the CPU usage rate also returned to normal. The entire process implements the “1–5–10” exception self-repair capability, namely 1-minute detection, 5-minute location, and 10-minute recovery.

External SQL Automatic Optimization

DAS continuously reviews and optimizes the SQL in the database based on overall workload and real business scenarios, just like a tireless professional DBA guarding the database.

*Figure: SQL Optimization Diagnosis in Automatic SQL Optimization*

It’s known from experience that about 80% of database problems can be solved by SQL optimization, there is always a complicated process that requires abundant database knowledge and experience from experts. Due to the constantly changing SQL workloads, SQL optimization remains a time-consuming and heavy task. All of these issues make SQL optimization a high-threshold, high-cost task that requires professional support. DAS continuously reviews and optimizes SQL in the database based on overall workload and real business scenarios, just like a tireless professional DBA guarding the database and upgrading the SQL optimization. At the same time, the SQL diagnosis capability of DAS has some distinctive technical features from traditional methods:

It uses an external and cost-based model to implement indexes, words rewriting recommendations, and performance bottleneck identification and recommendation. It helps avoid problems, such as traditional rules restriction, rigid mechanization, unguaranteed recommendation quality, and unquantified performance improvement.
A formal feature library for testing cases, automatic feedback extraction for online cases, and diversified application scenarios at Alibaba have been developed with sufficient coverage.
Workload features, such as SQL execution frequency and read-write ratio, are optimized to minimize the one-sided drawbacks of local optimization to achieve overall workload-based optimization.

The following is a real case of automatic SQL optimization during Double 11. DAS detected a load exception caused by slow SQL on November 7 and automatically triggered a SQL optimization closed loop. After the optimization, it takes 24 hours to constantly track the optimization effect and complete the optimization benefits evaluation. The optimization effect turned out to be significant. The average RTs and scanned rows of pre- and post-optimization are shown in the following figures:

*Figure: Average RT and Scanned Rows before Automatic SQL Optimization*

*Figure: Average RT and Scanned Rows after Automatic SQL Optimization*

According to statistics, the average scanned rows per second of SQL are 148889.198, and the average RT is 505.561 milliseconds before the optimization. After the optimization, the average scanned rows fall to 12.132, about one ten-thousandth of rows before the optimization. The average RT is also reduced to 0.471 milliseconds, about one-thousandth.

Automatic Elasticity

Cloud database provides users with computing-scale-based options and storage capacity options, which allows elastic scaling of workloads when consumers’ workload scale changes. However, for cloud-native applications, it is expected that databases can automatically select the most suitable specification according to business workload changes and complete the required database capacity with the minimum resources. This is also the embodiment of the database “autonomous” capability. Based on the AI time series prediction, DAS can automatically calculate and predict the business model and capacity level of the database to implement on-demand automatic scaling.

The automatic elasticity of DAS implements a complete data closed loop, including performance collection, the decision-making center, algorithm model, specification recommendation module, control execution, and task tracking and evaluation. Performance collection is responsible for collecting real-time performance data from instances, involving multiple performance indicators, specification and configuration, and instance operation session. The decision-making center performs overall judgments by referring to current performance data and instance session list data to realize overall autonomy based on root cause analysis. For example, SQL traffic limiting is used to address problems caused by insufficient computing resources. If the service traffic volume surges suddenly, elastic service continues to perform. At the core of the whole DAS automatic elastic service, the algorithm model detects business load exception of database instances and calculates the recommendation of capacity specification models. Issues like the time and method of scaling and the selection of computing specifications are resolved. Specification recommendation verification generates specific suggestions and adapts the deployment type of the database instances to an actual running environment. It double-checks the currently available specifications to ensure recommendations can be smoothly executed on the control side. Control execution refers to distribution and execution based on output specification recommendations. Task tracking is used to measure and track the performance changes on database instances before and after specification changes.

The following is a real case of automatic SQL optimization during Double 11:

As shown above, the user’s service traffic volume kept rising, and the CPU usage rate of the PolarDB instance continued to soar and reached a high load state. The auto scaling algorithm of DAS precisely identified the exception of the current instance, automatically adding two read-only nodes and reducing the CPU usage rate to a low level. After two hours, the user’s service traffic volume continuously rose, triggering the auto scaling of DAS for the second time. DAS auto scaling upgraded the instance specification from the 4-core 8 GB to the 8-core 16 GB and remained for over 10 hours, which helped users get through the peak service hours.

Intelligent Stress Testing

DAS provides an intelligent stress testing service to help users evaluate the specification and capacity of the database required before migrating their business to the cloud. More details about this topic will be introduced in subsequent sections. Specification auto scaling can help users automatically trigger the scaling operation based on the specified database performance threshold or the DAS built-in intelligent policy. This eases the burden of specification evaluation and management to some extent.

Most of the traditional stress testing solutions are based on existing stress testing tools, such as sysbench and TPCC. The biggest problem is that the corresponding SQL of these tools is far from real performances. Stress testing results cannot accurately reflect the performance and stability. However, the intelligent stress testing provided by DAS is based on the user’s real workload, enabling stress testing results to directly reflect the performance and stability changes under different workloads. To achieve this goal, intelligent stress testing needs to overcome the following challenges:

Providing long-term stress testing, such as 7x24 hours stability stress testing, when plenty of SQLs cannot be collected. Considering the time and storage cost of SQL collection, DAS needs to generate SQLs that meet business requirements with the given part of SQL.
Concurrent Playback Capability: DAS must ensure the same concurrency as the real business and provide speed control (such as two times and ten times) and peak stress testing functions.

By automatically learning business models, DAS automatically generates real workloads that suit the stress testing time and provides users with richer stress testing scenarios. By doing so, issues caused by Double 11 and database selection will be solved for users.

Automatic Parameter Tuning

Databases have hundreds of parameters and various user business scenarios, which make it impossible for developers to effectively tune the parameters to optimal configurations. Based on machine learning technology and intelligent stress testing, the optimal parameter template is automatically recommended for each database instance. The common databases on the cloud, such as MySQL and PolarDB, possess hundreds of parameters with each parameter value ranging from a dozen to half a million. Parameter configuration is similar to searching out qualified parameters in huge multiple dimensional spaces to improve TPS or reduce latency. DBA usually sets configurations using experience or default parameter values. However, parameters may change for different workloads and hardware. Even an experienced DBA cannot guarantee the validity of the parameter configuration. Many small and medium-sized enterprises on the cloud do not employ dedicated O&M personnel to set and adjust the optimal parameters. Therefore, database parameter settings face the following challenges:

The space for parameter combination is huge, and the traversal is NP-hard.
Parameter tuning relies on experience, and the validity of parameters cannot be guaranteed.
Various workloads on the cloud. Different workloads have different parameters.
Hardware Heterogeneity: Parameters vary with hardware specifications.

*Figure: Schematic Diagram of an Intelligent Parameter Tuning System*

*Figure: Intelligent Parameter Tuning Algorithms*

Via efficient machine learning, DAS considers parameter tuning a black-box optimization problem and performs iterative learning. Thus, the TPS of different workloads increases by 15% to 55% and the whole process can be completed in about 100 iteration steps in 3 to 5 hours. Thanks to Alibaba Group’s various workload and hardware infrastructures, the features and parameters of different workloads and hardware specifications can be generated offline. The model suitable for a given target workload online can be fitted by matching and Meta-Learning of partial order relation. With a few iterations, the parameter values of the target workload can be learned quickly, finding out the ideal parameters with about 10 to 30 steps in one hour.

Summary

Database Autonomy Service (DAS) enables enterprises to reduce database management costs by 90% and lower O&M risks by 80%. It also allows users to focus more on business innovation and keep business development. At Alibaba, after eleven years of experience, this year was the starting point of a new cycle to reflect on Double 11. DAS will keep supporting and promoting its innovation-based transformation.

DAS Was Polished and Enhanced to Provide Daily Support for Double 11

Introduction

DAS

DAS Core Autonomous Technologies

7x24 Real-Time Exception Detection

Fault Self-Recovery

External SQL Automatic Optimization

Automatic Elasticity

Intelligent Stress Testing

Automatic Parameter Tuning

Summary

Original Source:

DAS Was Polished and Enhanced to Provide Daily Support for Double 11

Alibaba Clouder December 21, 2020 273 The 2020 Double 11 Global Shopping Festival has come and gone, but the…

Written by Alibaba Cloud

No responses yet