Alibaba EagleEye: Ensuring Business Continuity through Link Monitoring
11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.
The annual Double 11 Shopping Festival (Singles’ Day) has always been a great battle for Alibaba Group. From a technical perspective, winning this battle requires coordination across individual systems, synchronization of functions in each application, and cooperation from technical staff. When it comes to testing and deployment, these systems and applications must be considered as a whole, instead of several separate components.
Alibaba Group EagleEye Monitoring System
As Alibaba Group’s link tracking system, EagleEye monitors the link status across the whole group although EagleEye’s own services are not included in the transaction link itself. It covers the majority of Alibaba Group’s scenarios covering the remote call to middleware, and plays a key role in troubleshooting. EagleEye ensures the stability of individual systems and provides significant support to the whole technical team towards winning this “battle”.
Figure 1 — EagleEye system overview
Over the past two years, Alibaba Group has seen rapid growth in its business size. Vertically, transaction volume reached a new record, and the transaction peak at midnight during the 2017 Double 11 Shopping Festival broke the record; horizontally, Alibaba Group’s industries and business fields are expanding, and Alibaba Group has established its cooperation with companies such as AutoNavi, Youku, Umeng, and Damai to seek co-development.
As data size continuously grows, EagleEye faces a great challenge on this year’s Double 11 Shopping Festival. The biggest challenge is responding to the growing data size due to rapid business development and ensuring the stability of EagleEye’s own services as data size continues to grow.
Figure 2 — EagleEye’s support for business
End-to-end stress testing has always been an efficient method that Alibaba Group uses to support Double 11 promo events, and test the load capacity of individual systems using online traffic simulation on Double 11 Shopping Festival. EagleEye plays an important role in end-to-end stress testing. It passes through stress testing labels to distinguish traffic, and collects and presents stress testing data to help business-side developers troubleshoot individual systems. Therefore, support for end-to-end stress testing is also an important mission of EagleEye.
EagleEye Performance Improvements
Whether during normalized stress testing, end-to-end stress testing, or during Double 11 Shopping Festival itself, the major problem that EagleEye is faced with is how to ensure the stability of its own system under the impact of large volumes of data, present status of individual systems more quickly, and help developers determine and locate problems. This year, with a series of alterations and upgrades, EagleEye has performance improvements and significantly helps developers on the business side perform troubleshooting in an efficient and effective manner.
Figure 3 — System architecture
Computing Capability Sinking
In the early stages, EagleEye’s link tracking and statistics were implemented based on detail logs, with complete detail logs collected and then aggregated in stream computing. As business size grew, log data volume increased sharply, and computing volume also grew linearly, which led to high resource consumption. In addition, the number of logs reaches a peak during end-to-end stress testing or big promo events, which often results in overloading computing cluster systems, data latency, or even data loss.
To solve these problems, sampling was originally used. By reducing the number of collected logs, we can stabilize the load and level of computing clusters, ensure the stability of EagleEye’s own services, and maximize the influences from business peaks. However, this also caused an obvious problem. Because actual data is estimated based on the sampling rate at the time that statistics are calculated, aggregated data is not accurate in scenarios where the volume of collected data is small and the sampling rate is high. In these scenarios, real business status cannot be presented and therefore collected data will lose its value.
To completely solve business peaks’ influence on EagleEye’s computing clusters, some real-time computational logic is utilized on machines on the business side to decouple the traffic volume from the required log volume and ensure the stability of computing clusters. The implementation method is: data is first aggregated by specified dimension (usually time dimension) on machines on the business side, then computing clusters collect statistics and aggregate data again. This can significantly stabilize the load on computing clusters.
Figure 4 — Computing capability sinking
Computing capability sinking can also be perceived as the distribution of computing, which consumes a very small part of resources on the business side and ensures the stability of EagleEye’s clusters. In addition, the computing volume of clusters will not grow as the business volume grows. Instead, its growth only depends on application scale (the number of applications and machines) and statistics dimensions. This avoids the problem with high loads caused by instant business peak values, and allows EagleEye to maintain stability and produce accurate data during end-to-end stress testing and big promo events.
EagleEye always focuses on calls to the middleware layer, but Alibaba has a large business volume and complicated systems. Each component has clear and specific functions, therefore some data in the middleware layer is difficult to associate with business data. As a result, it is a challenge to implement link tracking, troubleshooting and capacity planning targeting specific business scenarios.
This year, the scenario-specific link function is available in EagleEye. The business scenario labeling function is also available. Similar to labeling stress testing traffic, business scenario labeling labels specific business with a corresponding business scenario label and associates all calls to middleware (including services, cache, databases, and messages) under that label. This can help business-side developers better distinguish the business semantics in specific RPC traffic. Since RPC traffic under a specific business scenario can be clearly determined, it is also very useful for analyzing critical metrics such as buffer hit rate and database RT.
Figure 5 — Traffic scenario labels
Based on this data, we can perform better analysis on end-to-end stress testing data. Specified labels are added to critical business before stress testing (or in the normal status). After stress testing, a corresponding performance baseline is obtained based on traffic in each business scenario. This enables more efficient troubleshooting in core links and better performance, and improves the efficiency and significance in stress testing.
EagleEye’s link data plays a crucial role in determining and locating problems. Richer data forms and presentations significantly improve troubleshooting efficiency.
During the process of preparing for the Double 11 promo event, we encountered and solved many difficult problems. Among those, a large proportion of the issues were standalone problems. In distributed systems, standalone problems are very common. These problems aren’t usually directly relevant to business code, but are relevant to containers or machines to some extent. In addition, these problems are random, with a very low probability of occurrence. Therefore, finding these problems is usually a challenge. The problems, when presented in actual business, may be RT vibrations or other small probability events.
Although EagleEye’s call link can quickly locate these problems, the call link is from the perspective of a single request. After locating an IP, more data may need to be analyzed before appropriate decisions can be made. To solve these problems, EagleEye provides functions such as error TopN distribution and system heat map, allowing business-side developers to quickly locate problems. Standalone failures usually have little influence on the overall metrics, and are hard to locate through application-level monitoring metrics. EagleEye counts errors on individual machines, summarizes machines and finds the top 10 sorted machines. Once standalone failures occur, EagleEye can locate a specific IP, and make appropriate decisions based on the number of errors under that IP. This reduces the time for developers to troubleshoot problems. System heat maps can clearly show systems’ health during stress testing and big promo events. On one hand, we can clearly see if there are machines with outliers. On the other hand, we can verify if traffic goes to the right place.
Figure 6 — System heat map
As one of Alibaba Group’s efficient troubleshooting tools, EagleEye provides great service for business-side developers to quickly locate and solve problems, therefore reducing failure duration and improving maintenance efficiency. In fact, EagleEye also includes a large volume of data in its bottom layer. Over the past year, we have been leveraging and mining this data in an attempt to make the most of this data. Meanwhile, we want to establish an ecosystem based on this data to help users grow their business. During this process, we created many useful products, laying a solid foundation for Alibaba Group’s technical development.
Tiancheng Project: Tiancheng is a system stability solution that is based on scenario data of EagleEye and its monitoring metrics (such as middleware and system metrics) and combines many other monitoring products. It is designed to quickly discover and accurately locate problems, normalize big promo events and stress testing, and perform other tasks.
Vanguard Plan: A lightweight end-to-end stress testing. Vanguard Plan implements normalized end-to-end stress testing and troubleshooting based on EagleEye’s middleware, system metrics and stress testing data. It is one of the efficient tools used to provide support for Double 11 promo events and ensure smooth stress testing. This year’s environment is more complicated than last year. However, compared with the eight stress tests last year, only three stress tests are needed to achieve this year’s goal, saving more than 1,000 employees from tedious labor and significantly improving delivery quality and big promo efficiency.
Accurate Regression: EagleEye’s call link collecting and computing allow it to implement accurate recommendations targeting specific test cases and reduce some applications’ accurate testing time by 50%-70%. Utilizing the output from scenarios that are collected by EagleEye and returned from data, Accurate Regression can implement firm real-time generation of test cases and application code links on large-scale applications (applications with 10 million+ links).
Tiantu Project: Tiantu depends on partial link data from EagleEye and provides users with Application Performance Management (APM) solutions targeting complex business links and highly distributed architectures. Tiantu allows you to quickly understand the general information about applications and business links in a comprehensive, real-time, visual and intelligent manner.
Last year’s Double 11 promotional event was a great success. The technical team won this “battle” as EagleEye provided perfect support for the teams at Alibaba. The system’s stability and real-time capability reached the expected standard during end-to-end testing and during the Double 11 Shopping Festival. It provided powerful business support and improved troubleshooting efficiency.
However, there is still a long way to go. With continually faster intelligence development, the business side has continually growing requirements on the data quality provided by EagleEye. In the future, EagleEye will focus on improving architecture and advancing intelligence to further improve troubleshooting efficiency and provide better support for an ecology based on link data.