The Luogic Show is a company that provides knowledge services such as talk shows and podcasts, advocates the concept of “time-saving acquisition of knowledge”. How does the technical team implement the concept of time-saving in technical practice? This article describes the build process of the full link stress testing by the Luogic Show technology team. Let’s find out.
Full Link Stress Testing
The higher the visibility of the service, the greater the pressure on its technical team. Once a technical problem occurs, it may be magnified, especially when it serves a group of users with high requirements for knowledge acquisition experience.
Ensuring the availability and stability of services is the top priority and one of the technical challenges for the technical team. For example, Luogic Show provides knowledge services for global users who use fragmented time to learn in places, such as high-speed trains, subways and buses, who are likely to open the App in the early morning and late night, and who are located overseas. This requires the App to provide stable and high-performance services and experience around the clock.
In the actual production environment, and in the event of user access, the entire link from CDN to access layer, front-end applications, backend services, cache, storage, and middleware faces uncertain traffic. Whether it is a public cloud, a private cloud, a hybrid cloud, or a self-built IDC, the global bottleneck identification, and the overall service capacity exploration and planning need to be tested by a highly simulated full link stress testing. The uncertain traffic here refers to the irregular and unknown traffic caused by a large promotion activity, a regular high-concurrency period, and other unplanned scenarios.
As we all know, the service status of an app is not only affected by its own stability, but also affected by the traffic and other environmental factors, and the impact continues to be transmitted to the upstream and downstream. Even if a slight error occurs in one link, no one can be sure the impact the error will cause after several layers of accumulation in the upstream and downstream.
Therefore, the establishment of a verification mechanism in the production environment to verify that each production link can withstand all kinds of traffic access has become a top priority to ensure the availability and stability of services. The best verification method is to make the event occur in advance, that is, to allow real traffic to access the production environment, to implement a comprehensive simulation of real service scenarios, ensuring that the performance, capacity, and stability of each link are safe. This is the background of the full link stress testing and an all-round upgrade of the performance test to make it “predictable”.
Implementation path of service stress testing
It can be seen that if the full link stress testing is well performed, the system only needs to go through the scenario that has been repeatedly verified and take the “tested questions” again in the event of traffic in the real environment. It is possible that no problem is expected to occur.
Core Elements of Stress Testing
The path is very important to implement a complete service stress testing.
To achieve the goal of accurately measuring the supporting capacity of the service, the service stress testing requires the same online environment, the same user scale, the same service scenario, the same service level and the same traffic source, so that the system can perform a “simulation test” in advance, thus accurately measuring the actual processing capacity of the service model. Its core elements are: stress testing environment, stress testing basic data, stress testing traffic (model, and data), traffic initiation and control, and problem locating.
Stress testing environment and basic data
The processing of the basic data in the production environment is basically divided into two types. One is that the database level does not need to be modified and the test is directly executed based on the test account in the basic table (relevant data integrity is also required). After the stress testing, the flow data generated by relevant tests is cleared (the clearing method can solidify SQL scripts or fall on the system); The other is to separately label the stress testing traffic (such as the separately defined Header), and then the label is identified and passed in the service processing process, including asynchronous messages and middleware, and finally it falls into the shadow table or shadow library of the database. For more information about this method, see Alibaba full link stress testing practices. We also choose this method. In addition, the stress testing for the production environment should be carried out during the low peak period of the service to avoid affecting the service in production.
Build Process of the Full Link Stress Testing
At present, Luogic Show has already provided multiple traffic portals, such as Dedao APP, Dedao for Junior, and Dedao Store in the WeChat Official Account “Dedao”.
Every year, Luogic Show holds a New Year’s speech. In the first year, the speech was watched online by more than 2 million people on Youku, and in the second year, it was broadcast live on Shenzhen Satellite TV in cooperation with video websites, such as Youku. When the QR code was released during the live broadcast, we encountered a large number of user requests at the same time. This means that if the traffic during this period is not accurately estimated and the performance bottleneck is evaluated, a large service interruption may occur during the live broadcast.
It is the top priority of the technical team to carry out a full link stress testing on the whole system to effectively ensure the service stability of each traffic portal. Therefore, we planned to introduce the full link stress testing in 2017. During this period, we also made a service-oriented rebuild of the system. The content of the rebuild was disseminated in the media before. This time, we mainly talk about the core of ensuring service stability — the full link stress testing. The general build process is as follows:
- October 2017
Started the full link stress testing project to complete the design, implementation, basic data preparation and access of the rebuild plan for the homepage, details, shopping cart, and ordering of the Store, complete all the read interfaces related to the Dedao APP homepage and core functions, and perform a full link stress testing on the first read interface of the Dedao APP.
- November 2017
The core service and activity pattern of the Store were relatively stable, entering the comprehensive stress testing & crucial promotion period. At the same time, the rebuild for promotion, order, and payment began, the Dedao APP entered into the writing rebuild process, and the activity form began to take shape. The read and write coverage covered the homepage, listen to the book, subscription, purchased, promotion, and knowledge ledger, and the underlying rebuild of the user on the user center side began, including the simulation development of third-party service, such as SMS.
- December 2017
The Store completed the final payment rebuild, followed by the overall iteration of the full link stress testing and optimization. The access range of the full link stress testing of the Dedao APP continued to increase, covering all the homepage and 5 tabs on the lower side. At the same time, the iteration of partial module combination stress testing and performance tuning began. The joint rebuild for the underlying payment and user center completed. And the rebuild of the simulation for all external calls of payment and SMS completed. Several rounds of complete stress testing of the full link form were carried out, discoveries were found, and problems were located to improve system stability. The risk identification, plan, and shift were sorted out and focused.
After more than 3 months of centralized implementation, we have connected the full link stress testing to 174 links, and created 44 scenarios. The stress testing consumes 120 million VUM , and has found more than 200 various problems.
If the full link stress testing facilities for Luogic Show in 2017 is from 0 to 1, then 2018 is from 1 to N.
Starting from 2018, the full link stress testing has entered a relatively mature stage. Based on PTS and previous experience, our test team quickly applied the full link stress testing to daily activities and New Year’s activities, and to the newly-launched service “Dedao for Junior”. At present, the full link stress testing has become one of the core infrastructures to ensure service stability.
The build of the full link stress testing is not so much a project as an engineering. It is impossible to accomplish this only by our own technical accumulation and staffing. We would like to thank Alibaba Cloud PTS and other technical teams for their support and help in implementing the full link stress testing in Luogic Show. Below, we share the experience accumulated in the implementation process with you in the form of work notes.
A. Determining the traffic model:
When the traffic is large, it can be quickly determined through logging and monitoring. However, the daily peak value may not be that high, but the traffic volume of an activity to be dealt with is very large. One method is to calculate the peak values of each interface within a time period based on the service peak value, and finally combine them into a traffic model of stress testing.
B. Dirty data:
It is possible to generate dirty data whether the production environment is reformed to identify the stress testing traffic or the test account is used in the production environment. The best way is:
- In the simulation environment or performance environment, check and test shall be conducted more frequently:
It is very important to have a simulation environment. In this way, the follow-up, reproduction and debugging of many problems do not need to be carried out in the production environment to reduce risks.
- Multiple safeguard mechanisms are supported:
For example, for the stress testing traffic to be individually labeled, the UID must have a strong degree of discrimination, and the key data should be backed up in a timely manner.
It is a full link stress testing, the purpose of which is to identify and find problems comprehensively, so the coverage required for monitoring is very high. From network access to database, from network layer 4 to layer 7 and the service, with the deepening of stress testing, you may find that monitoring isn’t always enough.
D. Expansion of stress testing:
For example, we perform the stress testing to compare the selection of some technologies. At this time, we need to ensure the same traffic model and level. The speed and stability of automatic resizing or pre-planned manual resizing can be tested through the full link stress testing. In the later stage of the full link stress testing, important operations, such as the inspection of traffic limiting capacity, the actual inspection of various fault effects, and the exercise of pre-plans, should also be carried out.
E. Network access:
If more nodes are connected to the network, some DIS re-pressure tests can be conducted respectively to determine the capability and eliminate problems one by one. Then, after the whole is enabled, pressure tests can be conducted together to determine whether the overall setting and matching are correct.
For example, CDN dynamic acceleration, WAF, anti-DDoS pro, SLB and others are used on the network. If the overall stress test results are unsatisfactory, it is recommended to shield some links for stress testing and convergence problems. For example, session persistence between WAF and SLB may cause load imbalance. The capacity and specifications of these products also require stress testing and verification. For example, the CPS, QPS, bandwidth and connections of SLB may become bottlenecks. By integrating relevant SLB monitoring in PTS testing scenarios, you can easily view all the data all in one place, and you can choose the usage mode with the lowest cost in combination with pressure testing.
In addition to the network access mentioned above, when Nginx has a heavy internal load it may also cause imbalance, especially under high concurrent traffic, which also demonstrates the value of the full link and highly-simulated stress testing.
In particular, for stress testing of some important activities, it is recommended to test the actual traffic pulse in the service.
Alibaba Cloud PTS has this capability. It can observe the system performance under the peak pulse in the context of progressively increasing capacity, such as verifying the traffic limiting capability, and see if the peak pulse is recognized as DDOS.
F. Parameter tuning:
After the stress testing, we can find a large number of unreasonable parameter settings. Our tuning mainly involves the following: kernel network parameter tuning (such as fast recovery of connections), Nginx common parameter tuning, and PHP-FPM parameter tuning. A lot of relevant information about these exists on the Internet.
G. Cache and database:
- Whether important services are cached;
- If the Redis CPU usage is too high, you can generally check for fuzzy matching, unreasonable use of short connections, use of instructions with high time complexity, real-time or near-real-time persistent operations, and so on. At the same time, you can consider upgrading Redis to the cluster version, and consider the optimization mechanism of Local Cache for hot data (the activity form is very small due to the K-V, so it is suitable to consider Local Cache);
- With the progress of the stress testing and the discovery of problems, important databases may have incomplete indexes.
H. Mock service:
Generally, texting and payment depend on the third party. In this case, stress testing requires some special processing, such as setting up a separate Mock service, and then routing the stress testing traffic. This Mock service involves the simulation of a third-party service, so it needs to be as authentic as possible. For example, the delay of simulation should be close to the real three-party service. This Mock service is likely to encounter bottlenecks, so it is necessary to ensure its capacity and the stability of the interface delay under high concurrency. After all, the capacity, traffic limiting capacity and stability of some third-party payment and SMS interfaces are relatively good.
I. CPU threshold and service SLA of the system during stress testing
Our experience is that the recommended threshold for CPU is between 50% and 70%, mainly considering the environment of the container. Then, it is an Internet service, so the response time is also set at 1 second as the upper limit. During the stress testing, the actual hands-on testing experience is also carried out synchronously. (For detailed metric interpretation and threshold values, click to read the original article.)
- Even if the traffic limiting takes effect, it is very important to check whether the prompt and experience on the major client version after the limiting are as expected.
- The full link stress testing mainly covers various core interfaces. In addition, other interfaces must also have certain protection mechanisms, such as a default traffic limit threshold, to ensure that no non-core interfaces result in insufficient core system capacity due to unexpected traffic or insufficient estimated traffic.(For Java technology stack, you can learn about the Open Source Sentinel or Alibaba Cloud free traffic limiting tool AHAS)
- Core applications should be deployed separately on the physical machine layer;
Time-saving technical concept
At present, the full link stress testing has become one of the core technical facilities of Luogic Show, greatly improving the service stability. With Alibaba Cloud PTS, the automation of the full link stress testing is further improved, accelerating the build process and reducing the manpower investment. We are very concerned about the efficiency and concentration of the technical team. Not only for the build of the full link stress testing system, but also for the system building of many other service levels, we have all relied on the technical strength of our partners to support the rapid development of services within a controllable time.
When the business runs fast, the path and manner of technology construction are determined by the fundamental tone of the team.