How Alibaba Won the State Technological Invention Award in China
By Sun Hongliang, nicknamed Hongliang, and Ding Haiyang, nicknamed Linshi.
At the start of the year, the State Science and Technology Prizes Conference was held at the Great Hall of the People in Beijing. During the conference, Alibaba Cloud won two national awards, which are the State Technological Invention Award and the State Science and Technology Advancement Award. This is the first time that an Internet company has won two national science and technology awards in China. It also marks a breakthrough for an Internet company in the State Technological Invention Award. It shows Alibaba’s ten-year technical strategy is truly coming to fruition.
Behind these awards is much painstaking research. To win the second prize in the State Technological Invention Award, Alibaba Cloud worked with Professor Guo Minyi and his team at Jiaotong University, in Shanghai, on the research project “Key Technologies and Systems for Cloud Computing Facing Sudden Peak Services.” The technologies behind this research originated from ultra large-scale Internet e-commerce scenarios at Alibaba Cloud.
This all started a decade ago, on November 11, 2009, when Alibaba had launched the first Double Eleven Shopping Festival.
Continually finding a means to support several new consumption modes in some ever-evolving industries, characterized by “sudden peaks,” has become something of a penchant for Alibaba.
Today, ten years later, the “sudden peaks” scenario has become an increasingly common problem for several related industry technologies, and even a problem connected to the livelihood of average Chinese people. “Sudden peaks” is a factor in large-scale scenarios like red packet snatching during Chinese New Year, and the Chinese New Year Gala and Chinese New Years Eve events that gather people around the country. Through more than ten years of strategic technological investment, and more than ten years of maturing cloud computing technologies, Alibaba Cloud has successfully solved this challenge. By popularizing cloud computing and related technologies, Alibaba has delivered these capabilities to several different fields in the industry in China and elsewhere.
The Challenge of Sudden Peaks
First, before we get ahead of ourselves, let’s understand what it means to have cloud computing technology built for sudden peaks.
“Sudden peak services” refer to Internet services with a significant increase in the number of requests by end users from one unit of time to the next. When traffic surges, Internet services are prone to slow user request responses and system crashes. For example:
- On 2014, an e-commerce website at the 618 shopping carnival responded slowly and the website collapsed.
- On Chinese New Year’s Eve in 2015, a social media-based red envelope app became unresponsive, and transactions failed to be processed in the app.
Although elasticity and almost unlimited resources were one of the selling points of cloud computing when the concept of it was first launched, traditional cloud computing technologies are more or less designed for general elastic computing scenarios, not ultra-scale traffic peaks. So, during the processing of sudden traffic spikes, the following common problems occur:
- High cost: Extra capacity is required to meet peak demands, with no cost savings.
- Long latency: Nodes with low computing power in the cloud carry high loads, and scheduling is uneven.
- Low throughput: Storage device expansion failure surges, which leads to slow responsiveness.
- Slow expansion: The image repository network is congested, which results in tardy distribution.
- Difficult O&M: Expert experience evolution and querying are slow, and analysis is not intelligent.
Through the efforts of numerous engineers and researchers, Alibaba Cloud has solved these issues and created an high-efficiency resource integration technology based on a system of containers and hybrid deployment.
Starting Line for Technical Exploration
Alibaba’s container and co-location technologies is the result of the long-term exploration of sudden peaks, which can only be resolved by using a mixture of technologies. Every problem seems so difficult when it involves such high traffic spikes. The starting point for these technological breakthroughs can be traced back to the early days of “Double Eleven”.
In 2011, the development of virtualization technology in the entire industry was in full swing. Virtualization technology, as represented by KVM, XEN and VMware was almost a general trend in unified infrastructure.
However, a small team at Alibaba, led by Duolong and Bixuan, left the common path and decided to explore a project code-named “t4”, which evolved into Taobao’s fourth-generation computing engine. The technological concept of t4 is exactly the same as that of containers, which is the core technology of the current cloud-native world.
Every time we talk about this stage of history, at that time, Alibaba engineers always mentioned two things, “scaling efficiency” and “resource utilization rate.” The Double Eleven Shopping Festival put huge traffic pressure on computing services. In the face of access peaks, providing real-time “elastic and scalable” back-end systems was a technological challenge. And, in the face of the kind of traffic pressure that would exceed that of daily practices by a magnitude of several dozen, improving the resource utilization rate of data centers, so that resources did not grow linearly with the traffic was another challenge.
It is impractical to purchase another 100 servers to solve this problem of a hundred times the traffic pressure. Therefore, identifying the fundamental problem enabled Alibaba to take the first steps in developing container technology and work on the actual challenges, which paved the way for future research and development against the underlying problems.
Containerization, Microservices, and Co-location
While container technology has been quietly advancing, along with the increasing peak system load that faces Double 11 every year, the cost of cloud computing infrastructure has also increased exponentially. At the same time, as the daily usage rate of daily business is merely about 10%, purchasing additional servers can be very wasteful.
Therefore, in 2014, Alibaba began to explore the co-location technology by hosting the loads for both online services and offline big data computing in shared clusters, with a view to significantly improving the resource utilization of data centers. Both types of loads, online services and offline big data computing, have multiple complementary features, which make large-scale co-location feasible for data centers:
- In data centers, generally, the resource usage of online services is low and the durations of service peaks are short. In contrast to this, it is the case that service peaks occur constantly during peak hours of large promotions like Double 11, which typically feature a heavy dosing of pressure during the day. These kinds of scenarios can be sensitive to problems of high latency and jitters.
- However, the case for offline services is the opposite. In normal cases, resource usage pressure is relatively high and service resource usage is relatively constant. Moreover, resource pressure peaks often occur during night. This sort of scenario requires a high level of throughput, but is not sensitive to latency.
Intuitively speaking then, running both types of loads in a shared cluster can improve resource utilization. However, we are not exactly concerned about resource utilization but about whether all loads can run normally. When cluster resource utilization is high, online and offline services compete for resources, including the CPU, memory, network bandwidth, I/O, among other things. For microservices with ultra-long and complex links — under ultra-high cluster resource pressure — the delayed response of any computing resource can compromise the external business experience of the entire business. Therefore, ensuring low latency and stability in co-location scenarios became a great challenge for Alibaba.
To address this challenge in a thorough and systematic manner, Alibaba spent three years starting from 2016 on resolving this issue. By reconstructing data centers, fully migrating its businesses to the cloud, and conducting the containerization transformation of all of the group’s business operations, based on the early microservices of Alibaba business operations, a centralized and co-location scheduling resource pool based on containers and microservices was built, which supported the further evolution and upgrading of Alibaba’s overall infrastructure.
In macro terms of resource management, the online and offline service schedulers work collaboratively, and a sound technical architecture is established at the scheduling layer for the isolation between the co-located loads. With both schedulers performing their respective duties, the instantaneous needs of online services can be ensured to the maximum extent possible. Resources can be obtained promptly when needed. Moreover, the long-term requirements can be ensured for offline services. That is, sufficient resources can be made available to complete computing tasks within a given time window.
To reduce resource fragmentation during large-scale deployments, engineers upgraded the smart scheduling algorithm to solve such high-dimensional knapsack problems, greatly improving server resource deployment density, and allowing the same number of servers to run more applications at the same time, while also lowering costs.
In micro terms of resource isolation, Alibaba engineers dug deep into the operating system and kernel, and worked on various aspects that target different business characteristics. In terms of CPU resource allocation, engineers optimized the Linux CFS scheduling algorithm and designed priorities for various businesses with different importance in terms of queuing time and eviction levels. This ensured that important tasks can quickly obtain CPU resources when needed, reducing service response time and improving service stability.
In a multi-core processor environment, Noise Clean and other technologies are used to ensure that high-priority tasks can make full use of the microresources of their CPU cores, or hyper threads, further ensuring business stability. CAT, NUMA, JVM cold memory recycling, and other technologies are also exceptionally useful in this case.
In terms of the efficiency of ultra-large scale deployment, Alibaba Cloud’s Dragonfly provides efficient image distribution technologies, accelerating image distribution time exponentially. This decreases the completion time for image distribution tasks from several minutes to only a few seconds. Dragonfly is now one of the official projects approved by the Cloud Native Computing Foundation.
These measures also laid a solid foundation for Alibaba Group’s full cloud migration and transition to cloud native. This complete resource isolation and elastic sharing mechanism enables Alibaba’s large-scale data centers to stably support the sudden peaks of online businesses while improving resource utilization, and ultimately achieve zero incremental costs during large promotion campaigns.
In fact, the revolution of containerization, microservices, and co-location technologies that originated in dealing with the “sudden peak” problem have brought greater economic value to daily operations outside of the large Double 11 promotion.
For example, through co-location technology, the average daily resource utilization of Alibaba data centers has increased from 10% to 40%, reducing costs by billions each year, which is a historic breakthrough.
Moreover, in 2019, Alibaba Group migrated its services to the cloud through co-location clusters based on the inherent elasticity of Alibaba Cloud and the powerful and stable performance of its in-house X-Dragon server.
This architecture implements the architecture upgrade of the new-generation cloud native co-location technology based on security containers. It also adopts innovative methods such as data intelligence-driven O&M, scheduling, and management to implement automatic and intelligent management of cloud native co-location clusters, greatly improving the service stability of co-location clusters and further reducing costs.
Outside Alibaba: From Internal Use to General Adoption
Alibaba has gradually pushed the core technologies the group has accumulated over the years into the community. In this practice, one of the methods it uses is open sourcing, such as the efficient and lightweight enterprise-class rich container engine technology Pouch, and the large-scale and low-time consumption P2P distribution system Dragonfly. Another method is through Alibaba Cloud, which allows enterprises to easily use one of the world’s leading enterprise-level container technology and enjoy the technical benefits brought by the cloud era. According to Forrester’s third-party report, Alibaba Cloud easily ranks first in China in terms of container service capabilities.
Alibaba Cloud’s ability to handle sudden peaks is with no exception. So then, how does Alibaba Cloud use container technology to help the industry? The following sections describe solutions to common peak problems at Alibaba Cloud.
Off-Premises E-commerce: Stopping Sniping
E-commerce scenarios often features sudden peaks in a short period of time, with promotion campaigns being of relatively high frequency.
Traditionally, in e-commerce scenarios, due to the low elasticity efficiency of the underlying computing systems, resources that need to be scaled out are usually preset and remain ready for a long time, resulting in high costs. And resources to be scaled out for an activity must be preset according to the duration of the activity. And a more intensive activity incurs higher O&M workload, and more manual operations result in more errors, longer scaling times, and a higher difficulty of implementation.
However, by containerizing business systems, the business and running environment are integrated into container images, greatly simplifying deployment. Additionally, the orchestration application Kubernetes ensures the high reliability of service self-recovery from failures and auto scaling. The full lifecycle management of container services is also taken over by Kubernetes, which simplifies the O&M work.
With the capability of integrating ESS, clusters can auto-scale worker nodes. To do this, Alibaba Cloud can configure auto-scaling rules in advance, and after the worker nodes are ready, the containerized business will rapidly auto scale and schedule to available worker nodes. In the meantime, the network is automatically configured, and Kubernetes can ensure the high availability of applications. When the business peak elapses, Kubernetes automatically scales in the business based on the horizontal pod autoscaler (HPA) capability. After the business scales in, the cluster auto scaling rule will be performed to scale in and release the worker nodes.
Gene Computing: Typical Computing Tasks Benefit from Better Elasticity and Hybrid Cloud Capabilities
Gene detection computing in the genetics industry is also one typical computing scenario — although it is not as challenging as the e-commerce scenarios mentioned above — this scenario also greatly benefits from the elastic performance and cost advantages of containerization.
Currently, more than 70% of gene sequencing companies and research institutions in China use Alibaba cloud services. In traditional practices, due to restrictions on offline resources, it could take several days or even a month or more to produce results. To improve computing efficiency, companies would need to maintain a large number of computing resources, but detection does not actually run throughout the day.
However, an hybrid cloud container elastic computing architecture can solve this problem. When offline data collection is completed, the hybrid cloud is used to transfer the data to the cloud. Through container-based elasticity, the cloud can automatically and quickly pull up hundreds of detection programs within minutes, and then quickly complete computing operations.
When the detection is completed, resources are quickly destroyed within minutes, which greatly reduce the cost of resource use and makes full use of cloud capabilities. In addition to cost reduction, by relying on the powerful computing capabilities of Alibaba Cloud, it takes only 15 minutes to complete the process of high-precision individual genome sequencing, which on average takes 120 hours for the general scientific community to complete.
So far, containers have become the most popular means of development on the cloud and the optimal application running environment. Alibaba itself is a heavy container user. In addition to the accumulated experience of serving a large number of customers in the public cloud, combined with Alibaba Cloud’s elastic computing ECS, X-Dragon server, ESS, various middleware, big data service capabilities, and other strengths, Alibaba Cloud has become the best practice for migrating enterprise businesses to the cloud and for digital transformation of public, private, and hybrid cloud environments.
The Story for the 12306 App: Help Every Homecomer
The 12306 app is the application to buy train tickets in China Before a user attempts to purchase tickets, the number of remaining tickets must be checked, especially in special scenarios like Chinese New Years, when many Chinese return home to visit their family.
In this context, the 12306 app must provide a massive data query service, which is also the most important component of the system. In traditional IT solutions, a large number of hardware devices are required for the annual travel period of Chinese New Year. After the travel period of Chinese New Year, however, these devices will be idle, which results in a huge waste of resources.
Another serious problem is that, according to the traditional solution used, if the peak traffic during the Chinese New Year travel period exceeds expectations, the service will be paralyzed. However, to restore the service, it takes at least a month or two to purchase, mount, deploy, and debug large-scale servers, which would be far too late for handling the situation.
Therefore, migrating the business to the cloud and making full use of the elasticity provided by Alibaba Cloud can maximize the cost advantages and fully meet business needs. During the peak travel times, resource delivery for the remaining ticket query service can be completed in minutes in Alibaba Cloud.
In addition, resources can be released promptly after the traffic peak, greatly reducing costs and providing users with a smooth ticketing experience.
The preceding breakthroughs in cloud computing technology for sudden peak services was originally derived from Alibaba’s e-commerce nature and the hard work of Alibaba engineers in confronting these challenges.
They are also very much a result of the collaboration between Alibaba engineers and high-level teams from Chinese universities. We are proud that our technologies can help society solve increasingly important problems in the face of rapid development.
Here, we would like to quote Academician Wu Jiangxing’s evaluation of this project as a summary:
“‘Key Technologies and Systems for Cloud Computing Facing Sudden Peak Services’ reflects a unique model of technological innovation in China. The huge technical challenges of the ultra-large scale market have created many technologies that can be tested and refined only through experience and lessons learned from repeated iteration in large-scale applications. The process of quantitative change is bound to be accompanied by qualitative leaps. This period will undoubtedly give rise to motivation for scientific and technological innovation. In the past, people were biased against Internet companies and thought that they are incomparable to hardware vendors in high technological content. However, to be able to meet the needs of the largest user group in the world, it must be backed up by world-leading technological capabilities.”