10 Years of Double 11: The Evolution and Upgrade of Alibaba’s Cloudification Architecture
Double 11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.
The annual Alibaba Double 11 Shopping Festival (Singles’ Day) will be celebrating its 10th anniversary this year. Throughout its 9-year history, it has witnessed 280 times of transaction volume increase, over 800 times of peak transaction volume increase, and explosive growth of systems. The complexity and difficulty in supporting the Double 11 Shopping Festival increase exponentially every year for the system. During the events, user experience and cluster throughput are optimized with limited costs and peaks are solved at reasonable prices.
In this article, Ding Yu, Senior Staff Engineer of Alibaba Group, shares his experience in the evolution and upgrade of the cloudification architecture during nine years of the Double 11 Shopping Festival.
About the Speaker
Ding Yu is a senior technical expert at Alibaba Group. He participated in the preparation for the Double 11 Global Shopping Festival for eight times and is responsible for the high-availability architecture of Alibaba Cloud, stable operation of the events, Alibaba Cloud Container Service, scheduling systems, cluster management systems, and O&M technologies.
9 Years of Double 11
Hello, everybody. I am Ding Yu. I am glad to share with you the technical development of the Double 11 Shopping Festival by Alibaba. First, let’s think about the question: What are the challenges of the Double 11 Shopping Festival which drives the technical advancements of Alibaba?
- Internet scale: Hundreds of millions of users conduct transactions on Alibaba websites.
- Enterprise-level complexity: Each transaction involves the service support from hundreds of systems.
- Financial stability: Each transaction must be complete and accurate.
- System stability: Systems must be stable enough to handle service peaks during the Double 11 Shopping Festival, which are dozens of times more than the normal service volume.
Scalability and stability problems are solved effectively, thanks to technical breakthroughs such as distributed architecture, remote multi-active architecture, throttling and downgrade, and comprehensive stress testing. The system architecture has evolved in multiple generations over the nine years’ development of the Double 11 Shopping Festival, with significant improvement every year. Alibaba has accumulated a large number of Internet-based middleware technologies since it set out to transform the centralized system architecture to a distributed and scalable architecture in 2008. In 2013, the remote multi-active architecture was developed to deploy Alibaba’s complete transaction units in various cities of China for regional scaling. The two technologies are combined to solve the scalability problem during the Double 11 Shopping Festival.
The evolution of the distributed architecture brings a series of problems related to system stability, increased system complexity, and multi-system collaboration. We built the throttling and downgrade, contingency plan system, and online management and control system. We developed comprehensive stress testing for the Double 11 Shopping Festival in 2013. Stress testing is performed on large volumes of user-level read and write traffic expected during the Double 11 Shopping Festival, which is processed based on the dependencies of the system. The processing capability of the online production environment is verified in simulated scenarios of the events to locate and solve problems in a timely manner. These technologies have become standard requirement for the Internet.
Evolution of the Cloudification Architecture
After ensuring system reliability, we found a large amount of hardware, time, and labor costs consumed due to fast increasing peak values during the Double 11 Shopping Festival. The cost challenge drives us to solve the problem of IT costs, or server resources. First, let’s have a look at the evolutionary background of the cloudification architecture.
The preceding figure shows the data table on the peak values of Alibaba services over six months. The two maximum peak values indicate the transactions during the 11.11 and 12.12 Global Shopping Festivals. Other smaller peak values indicate transactions on average days. The red line indicates the resource processing capability of system servers on a daily basis.
We bought many server resources to support peak traffic during the Double 11 Shopping Festival before 2013. Many resources are wasted while the servers run inefficiently for a long time during the post-peak period. This is an extensive mode of budget and resource management. A multitude of clusters are generated by the diverse business models of Alibaba. Those clusters are operated by varying O&M systems and lack interoperability and elasticity. As a result, they make no contribution during the Double 11 Shopping Festival. The resource pool of each module has independent buffer pool, online rate, distribution ratio, and resource usage. We use the cloudification architecture to improve the overall technical efficiency and the elastic multiplexing capability of global resources. For example, the resources of a cluster not involved in the Double 11 Shopping Festival are contributed to the transactions of the events. Because the cloud provides the elasticity required by the Double 11 Shopping Festival, we depend on Alibaba heavily to solve the cost problem of the events. We streamline technical systems to reduce costs during sales promotions and average days, and propose the objective of reducing the cost per transaction by half through the cloudification architecture during the Double 11 Shopping Festival.
Let’s review the current status of the O&M system. We classify clusters into the online service cluster, computing task cluster, and ECS cluster. The three cluster types have independent resource management, basic O&M, and scheduling. They have independent scheduling and orchestration capabilities, online service Sigma scheduling, Fuxi scheduling of computing tasks, and Cloud Open APIs of ECS. They are different in terms of production resource, resource supply, and distribution mode. Online services use containers; computing task scheduling generates lightweight isolation and encapsulation containers of LXC; and cloud generates ECS instances. They are managed differently by O&M clusters at the application layer and run different tasks at the top service layer. An online server runs online services such as advertising during transaction search and stateful storage. Computing clusters run big data analysis tasks. Cloud clusters run the tasks of external customers.
Technologies are cloudified during restructuring and upgrade to build elastic and multiplexing capabilities for unified global scheduling. Online tasks and computing tasks are deployed in hybrid mode. Scheduling efficiency is improved through unified O&M deployment and standardized resource allocation to enable automatic capacity delivery. We need to implement comprehensive containerization and leverage cloud elasticity through public cloud to reduce the investment in self-purchased infrastructure. We multiplex Alibaba Cloud capabilities through one-click site construction and the elastic architecture of hybrid cloud to minimize the costs of the Double 11 Shopping Festival and reduce the one-year resource retention period to one to two months in Alibaba Cloud.
Unified Scheduling System
Built in 2011, Sigma is the scheduling system for Alibaba’s online services. It is operated by a cluster management system with scheduling at the core.
Sigma depends on the association of Alikernel, SigmaSlave, and SigmaMaster. Deployed on each NC, Alikernel enhances kernels, adjusts the allocation of resources and time slices flexibly based on priorities and policies, and makes independent decisions on task latency, task time slice preemption, and eviction of improper preemption through upper-layer rule configuration. SigmaSlave allocates CPU resources and handles emergencies on local devices. SigmaSlave makes quick decisions and responses to latency-sensitive tasks, which avoids service losses due to long-time global decision making. SigmaMaster is the most powerful component. From the global perspective, it makes decisions on resource scheduling and allocation and algorithm optimization when containers are deployed for a large number of physical machines.
The entire architecture is oriented toward the final state. Data is stored in persistent storage after requests are received, and resources are allocated based on scheduling requirements identified by schedulers. The system features excellent coordination and final consistency. We developed scheduling systems in 2011, overrode the scheduling systems with the Go language in 2016, and enabled compatibility with Kubernetes APIs in 2017, in the hope of growing with open source communities.
We take advantage of unified scheduling and centralized management and unleash profits from economies of scale. Various service models exist in the scheduling of online services and computing tasks at two layers. The utilization and allocation rates are improved through resource pool merging, and global resources are streamlined through spatial optimization by buffer merging. After the global coordination, the elastic time-based reusing and time-dimensional optimization can be implemented, reducing the total resource usage by 5%. This improvement has an obvious effect because the cardinal number is large.
Since 2014, Alibaba has been promoted the hybrid architecture, which is widely deployed on Alibaba internal systems. The online services feature long life cycle, complex rule and policy, and delay sensitive. However, the computing tasks feature short life cycle, high concurrency, high throughput, various priorities, and delay insensitive. To cope with the different requirements, the hybrid architecture concurrently handles the two types of schedule. That is, a noise cleaning server performs both Sigma scheduling and Fuxi scheduling. In Sigma scheduling, the SigmaAgent calls the OCI-compatible container runtimes, RunC, RunV, and RunLXC to start PouchContainer. Fuxi also preempts resources on this noise cleaning server to execute its computing tasks. All online tasks reside on PouchContainer. PouchContainer distributes server resources and assigns online tasks to them by scheduling. The offline tasks are filled into the blank area, ensuring the full use of physical machine resources. The co-location of two types of tasks are complete.
Key Technologies of Co-location
Key technologies in kernel resource isolation
- The noise cleaning technology is included to isolate CPU HT resources. This technology fixes the resource preemption problem of online and offline threads.
- The Task Preempt function is added based on CFS to isolate CPU scheduling resources. This function improves the priorities of online tasks.
- For CPU cache isolation, CAT implements isolation of online/offline three-level cache (LLC) channels (Broadwell or higher).
- The CGroup feature and OOM priorities implement memory resource isolation. Bandwidth control reduces the quota for offline tasks, implementing bandwidth isolation.
- The memory elasticity is improved without using additional memory, improving the effect of co-location. When online tasks are idle, the offline tasks can exceed the memcg limit. When online tasks require memory resources, offline tasks immediately release the resources.
- To ensure QoS, the management and control tasks have the gold priority, online tasks have the silver priority, and offline tasks have the copper priority. Different bandwidth levels are offered to them.
Key technologies of online cluster management
- Create profiles for the applications, including the memory, CPU, network, disk, and network I/O of applications. Obtain their characteristics and resource specifications, and real resource usage in different time segments. Perform the relevant analysis for overall specifications and time segments and perform scheduling optimization.
- Distribute resources based on affinity, mutual exclusion, and task priority. Affinity analysis means that when some applications are processed together, the computing resource usage is low and throughput is high.
- Different policies are enforced in different scenarios. Stability is the primary policy for 11.11. Stability precedence means the average policy. That is, exhaust all resources to reach minimum resource levels. In routine scenarios, resource usage is preferred, which means that make the used resources reach the maximum and free a large block of resources to execute the large scale tasks.
- Applications can be automatically scaled down and reuse resources based on time segments.
- The entire site can be quickly scaled up and down with the elastic memory technology.
Key technologies of computing task scheduling and MaxCompute
Three key co-location technologies are used: elastic memory time-based reusing, dynamic memory oversubscription, and lossless/lossy downgrade. Dynamic memory oversubscription means that extra memory can be allocated. However, if online tasks need the memory, the memory should be released quickly. Lossless/lossy downgrade means that lossless downgrade is performed when the interference is within the acceptable range. Downgrade is performed without adding new tasks. If the interference is beyond the acceptable range, the task is killed directly, which is lossy downgrade. Zero-layer management is used to control the relationships between online tasks and offline tasks on each noise cleaning server.
Co-location indicates to introduce computing tasks to an online service cluster to improve daily resource use efficiency. After introducing the offline tasks, the average CPU use efficiency is increased from 10% to more than 40% and the delay of sensitive applications is lower than 5%, which are acceptable. Currently, the co-location cluster contains thousands of servers and has been verified by the transaction core link on the Double 11 Shopping Festival. This optimization reduces 30% servers in daily operation. The optimization involves the iteration of hardware and network, so the preparation requires a long time. Alibaba will increase the deployment scale 10 times the next year.
Time-based reusing can further improve the resource use efficiency. The above figure shows the traffic curve of an application. The curve is regular. The left part indicates the trough period in the evening and the right part indicates the peak period in the daytime. In co-location, the resources in blue shadow are occupied to improve the resource use efficiency to 40%. Elastic time-based reusing means that the traffic trough period of the application is found according to the application profile to scale down the application, releasing memory and CPU resources. This can schedule more computing tasks. This technology can improve the average CPU use efficiency to above 60%.
How can the cost be reduced by using the co-location of computing task clusters on the Double 11 Shopping Festival? Computing task clusters have three states: no online service, online service and computing task coexistence, and most online service and temporary downgrade of computing tasks (on the Double 11 Shopping Festival). After cluster co-location, resources are divided in 3/7 mode. Computing tasks can preempt the resources allocated to online services. In stress test and promotion period (non-peak), resources are divided in 5/5 mode. In the peak period of promotion, computing tasks are downgraded temporarily to free resources. A complete transaction site can be set up within one hour, greatly reducing the cost of each transaction on the Double 11 Shopping Festival.
PouchContainer and Containerization Progress
Full containerization is a key technology that improves O&M capability and eliminates O&M differences. Then let’s learn about PouchContainer developed by Alibaba. PouchContainer was developed and launched in 2011. Developed based on LXC, it inherits the Docker image function and complies with many standards. Alibaba’s container has unique features. It integrates Alibaba kernel to improve its isolation capability. Currently, millions of containers are deployed on Alibaba internal systems.
Next, let’s see the history of Pouch. Alibaba used to adopt the virtualization technology of virtual machines. There were many O&M challenges in the migration from virtualization to container. O&M system migration requires a huge technology cost. We implemented an internal O&M system of Alibaba and focused on applications. The O&M system has independent IP addresses and can be logged on to through SSH. In addition, the system has an independent file system. The resources are isolated and resource usages are visible. Since 2015, Alibaba has introduced the Docker standard to form a new set of PouchContainer and integrate the entire O&M system.
PouchContainer is a rich container with a good isolation capability. Users can log on to the container to view the resource usage of processes and the number of processes. If a process is stopped, the container still runs normally. The container can run many processes. The container is backward and forward compatible, fully using existing devices. By verifying the millions of containers deployed, we developed a P2P image distribution mechanism, greatly improving the distribution efficiency. In addition, the container is compatible with many standards, such as RunC, RunV, and RunLXC, in the industry and promotes the formulation of new standards.
PouchContainer has a clear structure, and Pouchd can interact with kubelet, swarm, and Sigma. In terms of storage, Alibaba participates in the formulation of the CSI standard, supporting the distributed storage such as ceph and pangu. In terms of network, Alibaba uses lxcfs to enhance the isolation capability, in compliance with many standards.
Currently, PouchContainer is applied to most BUs of Alibaba. In 2017, millions of PouchContainers have been deployed. All online services of Alibaba are containerized, and computing tasks are also containerized. PouchContainer balanced the O&M costs of heterogeneous platforms. PouchContainer supports multiple running modes and compiling languages, and is compatible with the DevOps system. It has been applied to almost all business fields of Alibaba, such as Ant Financial, transaction, and middleware.
Alibaba announced that PouchContainer would open source on October 10, 2017, and PouchContainer formally opened source on November 19. The first formal version of PouchContainer was released in March 2018. We hope that the source opening of PouchContainer can boost the container field development and standard sophistication, and offer differentiated and competitive technical choices to the industry. Traditional IT companies can enjoy the O&M benefits brought by container without replacing the existing infrastructure, and start-ups can also enjoy the large-scale stability and standard compatibility.
PouchContainer open source URL: https://github.com/alibaba/pouch
After storage is separated from computation, the status of stateful tasks needs to be copied, which seriously affects the O&M automation and scheduling efficiency. We have implemented isolation of storage and computation during cloudification. The computing cluster and online service clusters are not in the same IDC. The data of computing tasks is stored on the cluster of online services first. Therefore, we set up a cache bridgehead, and then perform computation.
After adjusting the IDC structure and optimizing the network, we start to separate storage and computation for online computing tasks. Now, the cache bridgehead has been removed. Computation is not restricted by the network upload bandwidth. The traffic within large clusters does not need to pass the core, greatly improving scheduling flexibility. The storage and computation separation technology not only uses Alibaba Pangu technology, but also is compatible with the storage standards of the container in industry. It is also a key technology to implement Alibaba cloud architecture. In network architecture upgrade, the 25G network is widely used. VPC is used on the public cloud, and overlay can implement the connection of data on the cloud, off the cloud, and on the cluster. This is a premise of a large scale co-location.
Cloudification Architecture and Technology Path in Future Double 11 Campaigns
This is the hybrid cloud elastic architecture of Alibaba. It is a technical system based on orchestration, and a dynamic-oriented architecture. The architecture scales up and down units within several minutes, sets up transaction units quickly on the cloud or big data cluster, and completes routine inspection within several seconds to ensure reliable delivery. This architecture effectively reduces the resource holding time and non-online time of servers, shortens loss time, and improves elasticity efficiency. On the Double 11 Shopping Festival, more than 60% peak traffic is transmitted over Alibaba cloud. Alibaba cloud elastic infrastructure is fully used to set up the global largest hybrid cloud within eight hours.
Cloudification architecture O&M system in 11.11 campaigns
Resources are classified into online service cluster, computing task cluster, and ECS cluster. The infrastructure O&M systems, such as resource management, stand-alone O&M, condition management, command channel, and alarm, have been connected. In the 11.11 campaigns, we separate an independent area on the cloud to connect to other scenarios. In the connected area, Sigma scheduling can apply for resources from the computing cluster to produce PouchContainer, or apply for ECS from the cloud open API to produce container resources. In routine scenarios, Fuxi can apply for resources from Sigma to create required container.
In the 11.11 campaigns, a large number of online services are set up using applications and large-scale O&M, including the co-location at the service layer. Each cluster runs online services, stateful services, and big data analysis. Online services and stateful data services are also deployed on Alibaba exclusive cloud, implementing “data center as a computer”, which means that multiple data centers are managed like a computer. Resources can be scheduled for services across different platforms. Servers of the hybrid cloud are set up with the minimum cost.
The server scale is increased first, and then the resource use efficiency is greatly improved by using time-based reusing and co-location, implementing elastic resource reusing and flexible task deployment. The least servers are used to achieve required service capacity within the shortest time and highest efficiency. By using this cloudification architecture, Alibaba reduces the cost of new IT facilities by 50% on the Double 11 Shopping Festival and reduces daily IT cost by 30%, maximizing the technical value of cluster management and scheduling. This also indicates that the population of container and orchestration-based scheduling is the inevitable trend.
Later, we will open our storage technologies through the Alibaba cloud platform. These technologies have superiority in the internal scheduling container O&M field. They integrate the scheduling, orchestration, application management, monitoring, fast hybrid cloud building, elastic scale-up/down, and co-location capabilities. In addition, the technologies are compatible with Kubernetes API, providing the application management capability of enterprise-level container, improving IT efficiency of enterprises, and improving the competitiveness and innovation efficiency of enterprises. Verified by the 11.11 campaigns, co-location and automated hybrid cloud building are sophisticated and stable technologies. On the cloud, we work with ACS, EDAS, and EMR to improve the service integrity.
Future of Cloudification Architecture
Technologies reduce the cost of the 11.11 campaigns. We have found a correct direction. Through long-term building and development, we will make greater achievement in efficiency improvement. In the future, we hope to use the cloudification architecture to increase Alibaba IDC resource use efficiency and enlarge the scheduling scale and enrich models for co-location to increase profits. We will continue to promote the final state-oriented systematic architecture and O&M system, reduce the resource holding time by 30%, and further reduce the transaction cost in promotion activities.
The cost of the 11.11 campaigns has been optimized. We will focus on efficiency improvement as well as time and labor saving in the future. We will sample and analyze the technical variables on the 11.11 campaigns, make predictions in the microcosmic perspective, use data algorithm as drive, and make decisions intelligently. Based on data, intelligence, and man-machine interaction, we are ready for the next 11.11 campaign, higher efficiency with less labor investment. By accelerating the iteration of foundational technologies, we will find a new balance point among experience, efficiency, cost, and maximum throughput, bringing a perfect Double 11 Shopping Festival to the industry and consumers.
To learn more about the Double 11 Global Shopping Festival, visit www.alibabacloud.com/campaign/singles-day-11-11-2018