Unveiling the Secrets Behind Alibaba’s Full-scale Stress Testing for Double 11
By Zhang Chunmei (Niutu)
11.11 Big Sale for Cloud. Get unbeatable offers with up to 90% off on cloud servers and up to $300 rebate for all products! Click here to learn more.
The importance and necessity of performance testing is an echoing issue that cannot be overlooked. Today, we will summarize performance testing from Alibaba’s perspective, and explore how it is applicable to enterprises alike when it comes to technology and business strategy planning.
The purpose of performance testing is to remove system performance uncertainties caused by peak traffic in large marketing campaigns. An ideal marketing campaign cycle has the following closed-loop process:
Another step should be added between steps 1 and 2, which is environment reconstruction and basic data preparation, to emphasize that stress tests must be performed in the production environment.
- Preparation of the stress testing environment: A real-world online environment must be reused so that the stress testing results and problem exposure reflect the most realistic situations. Stress testing data can be identified and passed-through globally.
- Basic data preparation: By taking an e-commerce scenario as an example, construct core basic data, such as buyer, seller, and commodity information, which meets the needs of the promotion scenario. Online data is used as the data source and kept at the same magnitude to perform sampling, filtering, and data masking.
Performance testing evaluates capacity, identifies bottlenecks, and solves problems through a real and efficient stress testing method to ensure the stability of activities. Each step is critical; we’ll be going through each component in this article.
Process and Management
It has been seven years since the launch of Alibaba’s full-link stress testing in 2013. During the past seven years, we have continuously accumulated, summarized, optimized, and evolved from a large-scale activity that involved more than 200 people and all-night stress testing to the more intelligent way of stress testing in work time that requires only a few people. For a project of this size, effective process control and the division of duties are indispensable.
In its years of running the Double 11 safeguard (that is, the full-scale stress testing project), Alibaba has accumulated rich experience and developed a management mode of rigorous process control and division of duties, which is summarized as follows:
Note: The time points in the figure are simulated as a sequence reference only.
Effective process planning and management can greatly improve team collaboration efficiency. Combined with the intelligent capabilities provided by the tool platform, stress tests that used to run overnight and involve 200 people can now be streamlined to a single-day activity requiring fewer than 10 people. The equation in this case is: effective solution + adequate preparation + reliable platform technical products = a successful stress test.
The following describes how Alibaba implements stress testing for annual Double 11, specifically, data preparation, architecture reconstruction, and traffic security policies (environment and traffic isolation), stress testing implementation, and issue identification and analysis.
Reconstruction of the Stress Testing Environment
While preparing data, we should consider the type of the stress testing environment, that is, the deployment environment of the stress testing object. Different environments require different preparations.
In Alibaba, all stress tests, including the stress test for Double 11, are performed in the online environment. In this case, it is necessary to assess, for a full-link stress test, whether it is possible to directly use the existing environment, whether an API will be intercepted after being used for multiple stress tests, whether the impact of dirty data would occur, and how to avoid such impact. These problems can be classified into two types: business and data transmission. The problems are fairly clear-cut, so we will make reconstructions according to these two types.
The reconstructions include business reconstruction and middleware reconstruction, which have been completed in the era of internal full-link stress testing 1.0. For external customers, it can be used as a reference for technical transformation. At the same time, we can offer one-stop capabilities through mature products and solutions, saving you complex transformation and maintenance costs.
The purpose of business reconstruction is to handle business exceptions during stress testing or solve the problem where stress testing requests cannot be executed normally. For example:
- Traffic differentiation and identification: distinguishes stress testing traffic from business traffic, and allows both of them to be identified in the full-link system.
- The singleness of traffic: For example, if a user places an order and repeats it, the latter action fails.
- Traffic limiting and interception: If traffic limiting is required, inbound traffic degradation must be configured to enable configuration adjustment in real time.
- Remove the impact of stress testing data on statistics.
- Dynamic verification.
It is impossible to list all the content involved in business transformation, which instead needs to be sorted out according to different business models, business architectures, and configurations. After general streamlining and transformation, all subsequent new applications are developed in accordance with the specifications, and only basic troubleshooting is required before the annual stress testing.
As the component that connects business applications, middleware performs a critical function in stress testing, that is, it eventually passes on the traffic identifier up to the database level. We have completed the upgrade and transformation of the middleware used by core applications that started in 2013, and tackled many problems during that process. For example, the comprehensiveness of transformation, the business code modification costs, and version the compatibility issues caused by the transformation.
The following diagram shows the model of stress testing traffic after transformation.
Environment reconstruction requires analysis and design based on specific business scenarios. The cloud-based high-availability solution provides the full-link stress testing service.
Once the promotion schedule is confirmed, we will review the business model to determine the technical architecture applications corresponding to the business model, business scope to be included, data magnitude, and forms of data. Therefore, we prepare two kinds of data: business model data and stress testing traffic data.
Data preparation includes creating the business model and constructing basic data.
Business Model Data
Business model data is the data related to the business model under the stress testing, including the involved APIs and the stress testing magnitude or ratio among the APIs. The accuracy of business model construction directly determines the reliability of stress testing results.
The purpose of model design is to abstract an executable stress testing model from collected business operations and predict and design the elements in each sub-model to generate the final executable stress testing model. Before Double 11, we will confirm related businesses and classify the scenarios.
- Existing business scenarios: Historical data is collected and processed to generate prediction data that is used to form a prototype model. Combined with new business features, the prototype mode is then used to construct a model for existing businesses.
- New business scenarios: A new business model is directly constructed pro rata based on new businesses.
Both types of business scenarios are combined to form the final business model. The following diagram shows an example.
When assembling business model data, note some key factors, such as modifying the key factors of a specific e-commerce business model:
- 1 to N: determines whether multiples calls occur when an upstream business request corresponds to downstream business interfaces.
- Business proportion: calculates the proportion of different types of businesses based on historical data.
After the business model is assembled, the business model in a single transaction should be a funnel. The ratio between each layer varies according to different levels, business features, and rules. Theoretically, the ratio does not change in a promotional campaign. The following diagram shows the funnel model.
The business model corresponds to the stress testing magnitude levels in the stress testing. The RPS mode is used in stress testing for all Taobao promotion campaigns. The RPS mode, structuring APIs in a funnel starting from the business end, can be well applied to capacity planning. The RPS mode is also well supported in Performance Testing Service (PTS), which has been launched officially.
Basic Data of Stress Testing
While the business model is mapped to the interfaces or APIs in stress testing, stress testing traffic data is used to determine the content tested by these APIs, for example, logged-in users, products and stores browsed, products purchased, and even prices at payment.
Part of the traffic data is the specific RPS value corresponding to the preceding business model. The model reflects the proportional relationship, while the traffic data contains the specific RPS value for each stress test.
The most important part of the traffic data is the real stress testing data, which we call basic data, such as the buyer, seller, and commodity data of transactions. The purpose of full-link stress testing is to simulate scenarios in Double 11. Therefore, the authenticity of the simulation is very important, and the authenticity of the basic data is therefore crucial. Full-link stress testing uses online data as the data source. The online data goes through sampling, filtering, and masking to form data that can be used for stress testing.
When online data is used, particularly when data writing is involved, we use shadow tables when utilizing or reading data to prevent dirty data from being produced. When stress testing traffic is identified, the shadow table is read and written. Otherwise, the online official table is read and written. The shadow table is used to protect stress testing traffic.
Within Taobao’s internal stress testing system, the data platform and the stress testing platform are separated. The data platform manages and provides stress testing data including model data and traffic data, while the stress testing platform provides the stress-generating capability, which intends to ensure that stress testing requests are sent at a specified protocol and level of rate from all over the country. With the DataWorks function provided by PTS, we can combine the internal data platform with the stress testing platform to form a unified stress testing system that allows users to make easy configuration where needed after they construct the stress testing data and define parameters in the form of files or any custom form.
Traffic Security Policy
The traffic security policy is mainly designed to ensure that stress is imposed on traffic without data errors, and that the stress testing is executed securely and as expected. This involves two layers of considerations:
One is the rigorous isolation of test data from normal data, which is the monitoring and protection mechanism for illegal traffic.
- Method: shadow table data. The shadow table is a writable stress testing data table that has the same structure as the online table but is in an isolated location.
- Result: Data is isolated to avoid data errors.
The other is the secure filtering of stress testing traffic, which prevents stress testing data from being identified as attack traffic.
- Method: connect relevant security policies to the throttling and degradation function, slightly lift security policies for stress testing, or identify stress testing traffic by using a special identifier.
- Result: Stress testing traffic is not identified as attack traffic. The stress test runs successfully, while the security of online businesses is guaranteed.
Third-party systems such as Alipay and SMS must be connected to the stress testing system due to service specificity. In 2013, we ran the first full-link stress test, but failed to connect to its downstream business chain. In 2014, we built an end-to-end stress testing system that included Alipay and logistics service before the Double 11 stress test. For external customers, both Alipay and SMS provide mock services to perform full-link stress testing.
Stress Testing Implementation
Based on the process control mentioned at the beginning, start the full-link stress test after everything is ready. In addition to the stress testing in a general sense, two pre-operations are available, which are system ramp-up and logon preparation.
Note: This article does not describe the single-link stress testing debugging after the first transformation, because it can be run and verified by developers themselves.
The system ramp-up described here does not include the prerun that happens internally in Alibaba. The purpose of ramp-up is to cache necessary data in advance to the extent that the system is ready for the promotion campaign and achieve the goal of caching. To utilize the cache to the extent possible in the promotion, we ramp up the system.
External customers can run a lightweight stress test to ramp up the system in advance, including before a real-world promotion campaign, to cache necessary data in advance.
Logon preparation: Logon preparation is mainly used in scenarios such as flash sales and where persistent connection is required. In these cases, users log on to the system step by step and then perform business actions. In the event of a large-scale campaign, perform logon preparations in advance to simulate real user logon scenarios and protect the logon system.
Formal stress testing: Generally, multiple stress testing policies are implemented based on the stress testing plan. Stress tests for Taobao’s Double 11 generally include the following steps:
1) Peak pulse: The exact target peak traffic at 00:00 of the promotion day is simulated to conduct a promotion-state stress test and observe the system performance.
2) Vertical jump test: The rate limiting and degradation function is disabled and the current stress testing value is raised to observe the limits of the system. Note that a vertical jump test can be carried out only after the target stress testing value has been reached. You can perform multiple rounds of stress testing until the system encounters an exception.
3) Rate limiting and degradation verification: Whether the rate limiting and degradation protection function works is checked. The Application High Availability Service (AHAS) with comprehensive rate limiting and degradation capabilities is introduced to provide full-link degradation protection.
4) Destructive test: This test is designed to verify the effectiveness of the plan. It is similar to the plan implementation drill in a disaster recovery drill exercise. In this test, the promotion-state stress testing is kept going to verify the effectiveness of the plan and observe the impact on the system after the plan is implemented.
External customers can configure data at different magnitudes to run several stress tests and observe the system performance. Stress testing should be a repeated operation that stands multiple rounds of verification instead of a one-time operation.
Problem Identification and Analysis
After the stress test is completed, the stress test will be reviewed based on the system performance and monitoring data during the stress testing process to identify and analyze current system bottlenecks, determine the future improvement plan, and plan a time for the next stress test. The issues are located and analyzed in a case-by-case manner because they relate to many systems and different forms of sub-services, and in many cases front-line developers must be involved.
Performance Testing Service (PTS) generates a stress testing report that contains detailed statistics, trend charts, sampling logs, and monitoring data that has been added. In the future, PTS will additionally provide the architecture monitoring function to help performance testing engineers better identify the functioning of the system and general bottlenecks in stress tests from the perspective of system architecture.
Intelligent Stress Testing
Having started from scratch, Alibaba’s full-link stress testing has now entered its 7th year and evolved into an intelligent tool. Some of its functions will also be integrated into products that are officially launched. Please stay tuned!
- More supported protocols
- Capacity evaluation
- Automatic problem detection
- Full-link function testing and stress testing rehearsal
- Stress testing normalization
- Elastic promotion with scaling and stress testing in parallel
Alibaba has been running full-link stress testing for seven consecutive years, and has made progress through years of refinement and accumulation of experience. With the emergence of new technologies, we will continue to improve it in the pursuit of excellence. At the same time, we hope to use our experience to empower and help external customers sail through every promotion campaign, and to make full-link stress testing available for more everyday scenarios.