The Network Architecture and Network Management System behind This Year’s Double 11
The gross merchandise volume of this year’s Double 11 Shopping Festival reached a whopping 268.4 billion CNY, and all of Alibaba’s core systems were running right on Alibaba Cloud, meaning that Alibaba Cloud was powerful enough to not only withstand but also support one of the world’s largest traffic peaks ever recorded in the history of the Internet.
The Technology Stack for the Big Event
The smooth online shopping experience provided during this year’s big Double 11 shopping extravaganza was backed up by several complex and innovative technologies. The following figure shows a simplified view of the general-purpose e-commerce system architecture that Alibaba used to run its e-commerce platforms during the big day.
All the core business component systems were up in the cloud this year, running on Alibaba Cloud’s host of products, services, and solutions, including those specifically designed for computing, storage, networks, and databases.
Due to the huge processing capacity needed to run Alibaba’s e-commerce platforms, these business products, components and modules were and are deployed in a distributed manner. In addition, there are massive requests for product-to-product, component-to-component, and even module-to-module communication. Alibaba Cloud’s Cloud Network Management system, nicknamed Luoshen, can support all of these communication requests.
What Is the Cloud Network Management System?
The Apsara Distributed Operating System, nicknamed Pangu, is Alibaba Cloud’s main technology platform. It aims to turn a data center or even multiple data centers all over the world into one unified super computer, which internally manages servers and various physical resources and facilities in the data centers it occupies, and externally provides public services and programming interfaces for the masses it services.
The kernel for Alibaba Cloud’s Apsara Distributed Operating System provides some of the most basic of system services and virtualizes basic resources, especially the computing, storage, and network resources of the system. Next, the Cloud Network Management system also provides virtual network services, such as the Virtual Private Cloud (VPC), Software-Defined Network (SDN) controllers, and Server Load Balancer (SLB) network elements (NEs). In short, Cloud Network Management is the core component of the kernel of the Apsara Distributed Operating System, which provides all the functions of the cloud computing network.
The Features of Cloud Network Management
Having emerged alongside of the Apsara Distributed Operating System, Cloud Network Management has been around for 10 years. Its purpose is to enable smooth access among millions of virtual machines in twenty Alibaba Cloud regions based on several different technologies, which have accumulated and been improved upon over the years.
Complete In-house Development by Alibaba Cloud
Currently, Alibaba Cloud provides the richest set of networking products in the industry, including on-cloud, off-premises networks, as well as cross-region networks, hybrid cloud networks, and intelligent networks.
These products are developed based on the Cloud Network Management system, with core business code wholly developed in-house at Alibaba Cloud. So far, the code has accumulated to a mind-boggling millions of lines of code. The technical solutions and business logic of all the underlying software systems and hardware devices, again, were completely the creation of Alibaba Cloud. Therefore, Alibaba Cloud’s Alibaba Virtual Switch (AVS) is utterly different from the Open Virtual Switch (OVS) in various aspects, including its proprietary forwarding entry design and in terms of packet processing.
Based on the software-defined network (SDN) architecture, Cloud Network Management implements control and forwarding separation. Specifically, network elements (NEs) only forward data, and SDN controllers generate and deliver the control configurations and table entries.
Forwarding NEs are programmable in both their software and hardware modes. Moreover, all related business logic is implemented based on software code. Custom channel communication protocols are supported among SDN controllers. Software and hardware are both integrated and completely scalable.
Alibaba Cloud works with massive virtual machines to provide various services for millions of public cloud tenants and ultra-large-scale enterprises such as Alibaba Group. To support network communication between some massive tenants and these virtual machines, Cloud Network Management must be able to provide powerful network element management, table entry delivery, and data forwarding performance, which is nothing like the ones used for small-scale networks. In the actual running environment, Cloud Network Management has supported more than 100,000 virtual machine instances, more than 100 Gbit/s of public network bandwidth, and more than 20 Tbit/s of hybrid cloud bandwidth for each tenant.
How Does Cloud Network Management Support Double 11?
“Not just any cloud provider is powerful enough to support the scale of traffic you see with Double 11.” All the core systems of Alibaba Group were running on the public cloud for the first time. The peak load of 544,000 orders per second and the daily data processing capacity of 970 PB were supported by a virtualized distributed system. The communication between distributed nodes relies on the underlying cloud network infrastructure, which of course is Cloud Network Management.
Then, you may think, during the Double 11 Shopping Festival, what are the specific challenges that face the Cloud Network Management system, and how does Cloud Network Management tackle these challenges?
All the core businesses of Alibaba Group are now on the cloud, fully running on Alibaba cloud， which makes for a huge challenge when it comes to overall system management. Just consider this, approximately 100,000 hosts were running on a single VPC instance during the big Double 11 event in 2018, but that number has now increased to the whereabouts of 300,000 hosts this year. Few tech companies in the industry could support such a massive scale as this. Other tech giants, even those in the public cloud service industry, do not even deploy the entirety of their business on the public cloud. So, for a single VPC to support the virtualization of this scale of instances is completely unprecedented. In addition, the overall on-cloud public network and cross-domain outbound traffic was about 5 Tbit/s last year, which increased to dozens of Tbit/s this year.
Network devices at the logical level are composed of control and data forwarding devices. At the control layer, centralized SDN controllers use the traditional method, and the performance of delivering forwarding entries is low. As a result, the launch of virtual instances is slow, which affects the business provisioning efficiency and switchover efficiency. Therefore, the control system of Cloud Network Management adopts a hierarchical cluster architecture. While improving the centralization capability, it brings a large number of virtual instances online. This greatly improves the management configuration and table entry processing performance.
At the data forwarding layer, Cloud Network Management provides a technical architecture that integrates both the software and hardware sides. The VSwitch is upgraded based on the traditional DPDK architecture to support quick forwarding by programmable hardware.
Compared with traditional software VSwitches, programmable hardware-based VSwitches improve the forwarding performance by about 10 times and reduce the latency by more than half it was previously.
The rapid increase in the public network and cross-domain bandwidth also poses a great challenge to the performance of data plane development kit (DPDK) virtual gateways. With this, the device quantity increases, which increases management complexity and supply costs. Moreover, the single-core CPU capability is limited, and therefore cannot support burst traffic and high-bandwidth single-stream traffic, affecting normal communication.
However, through the technical architectural upgrade of the virtual gateway, software and hardware integrated gateways are supported. Moreover, the business logic is implemented in the programmable P4 language. External interfaces are compatible with software virtual gateways. Therefore, compared with the traditional 32-bit software architecture, the programmable hardware gateway improves forwarding performance dozens of times over, while also effectively preventing high-bandwidth single-streams to impact the single CPU core. In the software-hardware integrated architecture of Cloud Network Management, the traffic peak during the Double 11 Shopping Festival is well dispersed.
Stability is the most important aspect of the system because all the core businesses of Alibaba Group run on the public cloud. Individual nodes, as well as the entire network architecture and all the involved solutions must all be extremely stable for the system to function properly.
Cloud Network Management ensures stable network communication through its architecture. Businesses are deployed by zones, and gateways for public network and cross-domain access are deployed in clusters in different zones to prevent the impact of single point of failures (SPOFs). In addition, data is backed up between multiple zones.
Complex Traffic Model
The business system of Alibaba is complex. It includes the e-commerce shopping system, the Ant payment system, the big data analysis system, and the Cainiao logistics system. Each of these systems have different traffic priorities and network communication requirements, including when it comes to things like latency, bandwidth, and packet loss rates. Implementing all these complex businesses on a unified underlying cloud network can be challenging.
Consider Alibaba Group’s online and offline businesses for example. One major offline business is big data. Large-traffic consumers of big data often encounter the issue that the system’s bandwidth may be fully taken up due to traffic spikes, which can in turn cause packet loss issues. Online businesses generally require less traffic but are more sensitive to latency and packet loss issues, therefore they require that the relevant cloud networks support traffic classification, so that, in the case that the network is congested, low-priority traffic is discarded to protect the main offline and online business.
The Cloud Network Management supports different QoSs for different business scenarios. For businesses that require high bandwidth but are not concerned with packet loss issue, the priority of communication packets is set to low so that high-priority packets are not discarded upon traffic spikes and so that complex traffic models are well supported.
There is no perfectly reliable network in the physical world. Therefore, the Alibaba Cloud network is in an endless pursuit of having both the highest reliability and quick fault recovery. Before the Double 11 Shopping Festival, the Cloud Network Management system was constantly refined, so that the overall network O&M capabilities could be improved. This was done through various pressure tests and fault drills. All of this, of course, was to make sure that quick fault monitoring, location, and rectification could be achieved on the big day.
The Apsara Network Intelligence, the O&M platform of Cloud Network Management, is a distributed and intelligent big data O&M system that integrates the massive amounts of data on Alibaba Cloud through using big data and AI analysis capabilities, to help with the locating of system faults and conducting emergency measures much faster.
Based on the underlying network and virtual network data streams, logs, and device statuses, the blink-based big data analysis platform can quickly determine the status of network and identify the root cause of a fault to implement automatic emergency measures before the user even knows that a fault occurred. In addition, all of the typical, common faults are included in daily fault drills, so help to ensure even more efficient network O&M. The smart network is another powerful tool provided by Alibaba Cloud’s Cloud Network Management to support the Double 11 Shopping Festival.
From the above discussion, one thing is clear: the Cloud Network Management system is being continuously improved. And, with the evolution from the DPDK NEs in version 1.0 to software-hardware integrated NEs in version 2.0, Alibaba’s Cloud Network Management system has significantly improved in terms of network capability, which has also allowed all the core businesses of Alibaba Group to run on the cloud. In the future, Cloud Network Management will strive to be even more elastic and be open to provide users with an even better overall experience.