Performance Analysis of Alibaba Large-Scale Data Center
By Jianmei Guo, Senior Technical Expert
Data centers have become the standard infrastructure supporting large-scale Internet services. With the increasing scale of data centers, every upgrading and rebuilding of software (such as JVM) or hardware (such as CPU) in a data center comes at a high cost. Reasonable performance analysis helps to optimize and upgrade the data center and save on costs, while wrong analysis can mislead the decision-making and even result in cost increases.
About the Author
Jianmei Guo (nickname: Xi Bo) is a senior technical expert at Alibaba, and is currently mainly engaged in performance analysis of data centers and performance optimization combining software and hardware. He is also a member of CCF Technical Committee of System Software and Technical Committee of Software Engineering.
This article is based on the sharing of Jianmei Guo at Java related industry conferences. It mainly introduces the challenges and practices of performance monitoring and analysis of Alibaba large-scale data center.
From Software to Hardware
Hello everyone, I am very glad to have the opportunity to communicate with developers in the Java community. My research field is software engineering, mainly focusing on system configuration and performance. A common activity in software engineering is to find bugs. Of course, it is very important to find bugs, but even bug-free programs could be misconfigured, resulting in software configuration problems.
Software often needs to be configured. For example, a Java program or JVM can be configured with many parameters at startup. Through configuration, a set of software can flexibly provide various customized functions, and these configurations also have different effects on the overall performance of the software. These are still about software configuration. After working in Alibaba, I have the opportunity to extend this work to hardware, and focus more on hardware, such as CPU, to see the impact of system configuration changes and upgrades on performance, reliability and service launch. Today, I’ll mainly talk about some of my work in this area.
The most representative event of Alibaba is “Singles’ day” — the world’s biggest online shopping day. Here, I still use the data from 2017. The upper left corner shows the sales on Singles’ Day, which is about 25.3 billion US dollars in 2017, more than the combined sales of Thanksgiving, Black Friday and Cyber Monday in the same period in the United States.
This is the data from the business level. Technicians may pay more attention to the data on the right. On Singles’ Day in 2017, the peak value of transactions reached 325,000 per second and the peak value of payments reached 256,000 per second. What does this high peak performance mean to an enterprise? It means cost! The reason for paying attention to performance is to continuously improve performance and save costs through continuous technological innovation.
The peak performance at 12 o’clock midnight on Singles’ Day is not a simple number, and needs a large-scale data center to support. In short, the top layer of Alibaba infrastructure is a variety of applications, such as Taobao, Tmall, Cainiao, DingTalk, Alibaba Cloud and Alipay. This is also a feature of Alibaba, that is, it has rich business scenarios.
The bottom layer is a large-scale data center with millions of connected machines. These machines have different hardware architectures and different distribution locations (they are even distributed all over the world). The middle layer is what we call the middle ground. The closest to the upper-layer applications are databases, storage, middleware, and computing platforms, followed by resource scheduling, cluster management, and containers, and then the system software, including the operating system, JVM, and virtualization.
The products in the middle ground are the link between communities and enterprises. Over the past two years, Alibaba has open sourced a lot of products, such as Dubbo and PouchContainer. It can be seen that Alibaba attaches great importance to the open source community and the conversation with developers. Nowadays, many people are talking about open source communities and ecosystems, and various forums have also appeared. However, there aren’t many activities that directly talk to developers like this, and the promotion of community development ultimately depends on developers.
Such a large-scale infrastructure serves the entire Alibaba economy. From the business level, we can see indicators, such as sales of 25.3 billion US dollars and 325,000 transactions per second. However, the way in which these business indicators are broken down into various parts of the infrastructure is very complicated. For example, when we develop a Java middleware or JVM, we do performance evaluations.
Most technical teams have performance improvement indicators after developing products, such as a 20% reduction in CPU utilization rate. However, what percentage of the performance improvement of these individual products is in the entire transaction link and the entire data center? How much does it contribute to the overall performance improvement of the data center? This problem is very complex and covers a wide range of areas, including complex and interrelated software architectures and various heterogeneous hardware Some of our thoughts and work in this area will be mentioned later.
Alibaba e-commerce applications are mainly developed in Java, and we have also developed our own AJDK. This has made a lot of customized development for OpenJDK, including incorporating more new technologies, adding some patches as needed, and providing better troubleshooting services and tools.
As we all know, in 2018, Alibaba was selected and re-elected to the position of JCP EC (Java Community Process — Executive Committee) for a term of two years, which is a great event for the entire Java developer community, especially for the domestic Java ecosystem. However, not everyone understands the impact of this event. I remember one of my colleagues has mentioned that JCP EC is helpful to a company with large business volume like Alibaba, but meaningless for a small company.
In fact, this is not the case. When running for JCP EC, large companies, small companies and some community developers are eligible to vote. No matter whether it is a big company, a small company or a developer, they all have one vote and have the same status. Many small foreign companies are more willing to participate in community activities. Why?
For example, you have developed a feature on JVM 8 due to business needs. It has took a lot of effort to complete the development and debugging, and finally the business is successfully launched. However, the community recommends you to upgrade it to JVM11. At this time, you may need to re-develop and debug the feature on JVM 11, and you may encounter more traps, which obviously increases the development cost and prolongs the launch cycle. But what if you can influence the establishment of community standards? You can propose to integrate this feature into the next release version of the community. This gives you the opportunity to make your development work a standard for the community, and you can also improve this feature with the help of the community, thus not only improving the technical influence but also reducing the development cost, which is very meaningful.
In the past, we mainly relied on small-scale benchmark tests for performance analysis. For example, we have developed a new JVM feature to simulate e-commerce scenarios. Everyone may run the SPECjbb2015 benchmark test. For another example, to test a new type of hardware, you need to compare the benchmark metrics of SPEC or Linpack.
These benchmark tests are necessary, because we need a simple, reproducible way to measure performance. However, benchmark tests also have limitations, because each benchmark test has its limited operating environment and software and hardware configurations. These configuration settings may have a great impact on the performance. And, whether these software and hardware configurations meet the needs of enterprises and whether they are representative are issues that need to be considered.
The data center of Alibaba has tens of thousands of different business applications and millions of different servers distributed around the world. When we consider upgrading software or hardware in a data center, a key issue is that whether the effect of a small-scale benchmark test can be extended to the complex online production environment in the data center.
For example, we have developed a new JVM feature and we have seen good performance gains in the SPECjbb2015 benchmark test. However, when conducting a grayscale test on it in online production environment, we have found that this feature can improve the performance of one Java application, but reduce the performance of another Java application. In addition, we may also find that even for the same Java application, the performance obtained on different hardware are quite different. These situations are common, but we cannot perform tests for each application and every kind of hardware. Therefore, a systematic method is required to estimate the overall performance impact of this feature on various applications and hardware.
It is important for a data center to evaluate the overall performance impact of each software or hardware upgrade. For example, at the business level, we may be mainly concerned with two metrics, the sales and transaction peak on “Singles’ Day”. Then, how many new machines do we need to buy when these two metrics double? Do we need to buy twice as many machines? This is a means to measure the improvement of technological capability, and a way to reflect the impact of new technologies on new businesses. We have proposed many technological innovations and discovered many opportunities for performance improvement, but we need to be able to see them from the business perspective.
To solve the problems mentioned above, we have developed the SPEED platform. The first is to estimate what is currently happening online (that is, Estimation). It collects data through global monitoring and then analyzes the data to find possible optimization points. For example, if the overall performance of some hardware is relatively poor, replacement can be considered.
Then, we make online evaluations for software or hardware upgrades and rebuildings (that is, Evaluation). For example, when hardware vendors launch a new hardware, they perform a bunch of evaluations and get a set of relatively good performance data. But as mentioned earlier, these evaluations and data are all run under specific scenarios. Are these scenarios suitable for the specific needs of users?
No direct answer exists. Generally, users do not allow hardware vendors to perform evaluation in their business environments. In this case, users need to perform a grayscale test with the new hardware. Of course, the larger the grayscale, the more accurate the evaluation. But the online environment is directly related to the business, so the actual grayscale usually ranges from dozens or even several machines to hundreds or thousands of machines, to reduce the risk. One of the problems to be solved by the SPEED platform is to make a good estimation even when the grayscale is very small, which saves a lot of costs.
As the grayscale increases, the platform continuously improves the performance analysis quality, thus assisting users in decision-making (that is, Decision). The decision here is not only to determine whether to upgrade new hardware or new software, but also to have a good understanding of the full-stack performance of software and hardware, and to understand what kind of hardware and software architecture is more suitable for the target application scenario, so that the direction of software and hardware optimization and customization can be considered.
For example, the architecture of Intel CPUs has changed a lot from Broadwell to Skylake. What is the direct effect of this change? Intel can only give answers from the benchmark tests, but users may give their answers based on their own application scenarios, thus putting forward customized requirements, which has a significant impact on costs.
Finally, it is Validation, which is to verify whether the above method is reasonable through the effect of large-scale launch, and to improve the methods and platforms at the same time.
The performance analysis of software and hardware upgrades in a data center requires a global performance metric, but no unified standard exists at present. Google published a paper on ASPLOS this year, proposing a performance metric called WSMeter, which is mainly based on CPI to measure performance.
In the SPEED platform, we also propose a global performance metric called RUE (Resource Usage Effectiveness). The basic idea is simple, which is to measure the resources consumed by each Work Done. Here, Work Done can be a Query completed in e-commerce or a Task in big data processing. The resources mainly cover four categories: CPU, memory, storage and network. Usually, we focus on CPU or memory, because these two components currently make up most of the server cost.
The idea of RUE provides a way to comprehensively measure performance from multiple perspectives. For example, the business side reported that the response time of the application on a machine has increased, and the load and CPU utilization have also increased when logging on to the machine. At this time, you may be nervous and worry about a failure, which is probably caused by a new feature that has just been launched.
However, you should check the QPS metric at this time. If QPS also increases, then this guess may be reasonable, because more resources are used to complete more work, and the improvement of resource usage efficiency may be caused by new features. Therefore, performance needs to be comprehensively measured from multiple perspectives. Otherwise, unreasonable evaluations may be caused and real opportunities for performance optimization may be missed.
The following are several challenges in data center performance analysis, which are basically specific problems encountered online. I hope this can spark some thoughts.
The first is the performance metric. Many people may say that I am familiar with the performance metrics and use them every day. In fact, it is not that easy to really understand the performance metrics and system performance. For example, One of the most commonly used performance metrics in a data center is CPU utilization. In a given scenario, the average CPU utilization of each machine in the data center is 50%. If the application demand does not grow and the software does not interfere with each other, can the number of existing machines in the data center be halved?
Ideally, when the CPU utilization reaches 100%, the resources can be fully utilized. Can we simply understand the CPU utilization and data center performance in this way? The answer is no. As I mentioned earlier, a data center has not only CPU, but also memory, storage and network resources. The data center may fail to run many applications if the number of machines is halved.
For another example, after a technical team upgraded the software version it was responsible for and saw a 10% drop in average CPU utilization through online testing, thus claiming a 10% improvement in performance. This claim is correct, but we are more concerned about whether it can save costs after performance improvement. For example, if performance is improved by 10%, can we turn off 10% machines involved in this application? In this case, we should not only check the CPU utilization, but also the impact on throughput.
Therefore, you may be familiar with the system performance and various performance metrics, but a more comprehensive understanding is needed.
As mentioned earlier, the Estimation stage of SPEED collects online performance data. But is the collected data correct? Here is an example of Hyper-Threading, which may be familiar to those who know about hardware. Hyper-Threading is a technology of Intel. For example, notebook computers are usually dual-core now, that is, two hardware cores. If Hyper-Threading is supported and enabled, a hardware core becomes two hardware threads, which means a dual-core computer has four logical CPUs.
The figure on the top shows two physical cores with Hyper-Threading disabled. The CPU resources on both sides are fully utilized. Therefore, the average CPU usage of the entire machine reported by the task manager is 100%. The figure in the lower-left corner shows two physical cores with Hyper-Threading enabled. One hardware thread on each physical core is fully utilized, and the average CPU usage of the entire machine is 50%.
The figure in the lower-right corner also shows two physical cores with Hyper-Threading enabled. The two hardware threads of one physical core are fully utilized, but the average CPU usage of the entire machine is also 50%. The CPU usage in the lower-left and lower-right corner is completely different, but if we only collect the average CPU utilization of the entire machine, the data we see are the same!
Therefore, when performing performance data analysis, you should not only consider data processing and computing, but also pay attention to how the data is collected, otherwise you may get misleading results.
The hardware heterogeneity in the data center is a major challenge in performance analysis and a direction for performance optimization. For example, the Broadwell architecture on the left is the Intel’s mainstream server CPU architecture for the past few years. In recent years, the Skylake architecture on the right has been introduced, including the latest Cascade Lake CPU. Intel has made great changes in these two architectures. For example, Broadwell still maintains the years-old loop model for memory access, while Skylake uses the grid model instead.
For another example, L2 Cache is increased by four times for Skylake, which can improve the hit rate of L2 Cache in general. However, a larger cache does not mean a better performance, because maintaining the cache coherence brings additional overhead. These changes have both advantages and disadvantages, but we need to measure the impact of both advantages and disadvantages on the overall performance, and consider whether to upgrade all servers in the data center to Skylake based on the cost.
It is necessary to understand the differences in hardware, because these differences may affect all applications running on them and become the direction of hardware optimization and customization.
The software architecture of modern Internet services is very complex, such as Alibaba e-commerce architecture, and complex software architecture is also a major challenge for performance analysis. For example, in the figure above, the coupon application is shown on the right side, the main promotion venue application is in the upper left corner, and the shopping cart application is in the lower left corner. These three are common business scenarios in e-commerce. From the perspective of Java development, each business scenario is an application. E-commerce customers can select coupons from the main promotion venue or from the shopping cart, depending on the habits of these customers.
From the perspective of software architecture, the main promotion venue and shopping cart applications form two portals for the coupon application. Different portals have different calling paths for the coupon application, thus the performance impact is different.
Therefore, when analyzing the overall performance of the coupon application, it is necessary to consider the various intricate architectural associations and calling paths in the e-commerce business. It is difficult to completely reproduce such complex and diverse business scenarios and calling paths in benchmark tests, which is why we need to perform online performance evaluation.
This is the famous Simpson’s Paradox in data analysis. Many common cases exist in the field of sociology and medicine. We have also found some in the performance analysis of data centers. This is a real case online. We do not need to delve into what the App is.
The previous example is also used here. If the App is a coupon application and a new feature S is launched during a big promotion, and the proportion of machines in the grayscale test is 1%, then this feature can improve the performance by 8% based on the RUE metric, which is quite a good result. However, if the coupon application has three different groups and the groups is assumed to be associated with the different portals just mentioned. Then, from the perspective of each group, this feature degrades the performance of the application.
With the same set of data and the same performance evaluation metrics, the results obtained through the overall aggregation analysis are exactly opposite to that obtained through the separate analysis of each part. This is the Simpson’s Paradox. Since it is a paradox, it indicates that it should be determined according to the specific situation whether to view the overall evaluation results or the separate evaluation results of each part. In this example, we choose to check separate evaluation results, that is, the evaluation results on the groups. Therefore, it seems that this new feature causes performance degradation and should be modified and optimized.
Therefore, for the performance analysis in a data center, it is also necessary to prevent various possible data analysis traps. Otherwise, the decision-making may be seriously misled.
Finally, let’s take a few minutes to describe the requirements of performance analysts. In general, performance analysts are usually required to have knowledge of mathematics and statistics, as well as knowledge of computer science and programming, and more importantly, the long-term accumulated knowledge in this field. Knowledge in this field includes an understanding of software, hardware, and full-stack performance.
In fact, I think every developer can think about it. We should not only develop functions, but also consider the performance impact of the developed functions, especially their overall performance impact on the data center. For example, for JVM GC development, we are more concerned about the GC pause time in the community. However, we can also consider the relationship between this metric and the response time of the Java application, as well as the CPU resources consumed.