At the GNTC Cloud Summit on November 15, Mr. Zong Zhigang, a senior network technology expert at Alibaba Cloud, first shared the keynote speech Apsara Luoshen — the high-performance network engine that powers Alibaba Cloud. Apsara Luoshen is the core of Alibaba Cloud’s virtual network system. It provides extensive Alibaba Cloud network products, and serves as the network infrastructure of all Alibaba Cloud products. Apsara Luoshen also powers the business of Alibaba Group and Ant Financial. This sub-forum deeply explores the key technical capabilities and the future application directions of the Apsara Luoshen system.
What Is Apsara Luoshen?
A part of Mr. Zongzhigang’s thrilling speech is presented as follows:
Alibaba Cloud subsystems are named after Chinese mythological deities, such as Luoshen, Fuxi, Pangu, and Nuwa.
In Chinese mythology, Luoshen is the deity of rivers. Rivers have been a very important transportation means for ancient people. Nowadays, networks are as important to our life and work communication as rivers to our transportation in ancient times. Therefore, we named Alibaba Cloud Network as Apsara Luoshen at the time of its foundation.
So far, Apsara Luoshen is now managing an enormous infrastructure network system for Alibaba Cloud, which covers 19 regions, has a total outbound bandwidth of 20 Tbit/s, and runs more than 200 BGP lines. These numbers are still increasing. Alibaba Cloud provides the most extensive network products in the industry, which are divided into four product lines based on where they are deployed:
- Networks on the cloud — Apsara Luoshen, such as VPC, NAT Gateway, Server Load Balancer (SLB), and the IPv6 network that is going to be vigorously developed by Alibaba Cloud
- Networks between clouds — Apsara ZhiNu, such as Cloud Enterprise Network (GEA) and Global Acceleration (GA)
- Networks connecting on — premises data centers and the cloud — Apsara ChangE, such as Express Connect, VPN Gateway, and Smart Access Gateway (SAG)
- Smart robot — Apsara QiTian, on the basis of Alibaba Cloud big data analysis, provides solutions to troubles of network planners and operators in the form of human-machine interaction. It was developed to improve the efficiency of network operators and planners. It is being tested, and is expected to go online in the next year.
Alibaba Cloud Network Technology Architecture
Alibaba Cloud’s network architecture is an assembly of the above network products, and Apsara Luoshen represents Alibaba Cloud networks on the cloud.
To better understand the logic of this architecture, let’s take a look at the general steps of enterprise cloudization and the changes in their IT systems, to see the network requirements of enterprise cloudization. Enterprise cloudization is divided into four stages:
- In the first stage, the enterprise migrates some IT systems to the cloud, to improve user experience and the O&M efficiency.
- In the second stage, the enterprise migrates all infrastructure resources to the cloud, to optimize the IT resource usage.
- In the third stage, the enterprise carries out micro-service transformation of the architecture, builds a middleware-centric business architecture, shares business modules with other enterprises, and quickly builds its own business system to improve IT agility.
- In the final stage, Alibaba Cloud believes that when an enterprise fully rolls out its business on the cloud, it can collect a massive amount of data. Such data, through analysis using smart learning and big data analysis tools, will be helpful in guiding the enterprise to provide more precise services in a more precise manner.
Migrating Enterprise Systems to the Cloud
As can be seen from the evolution history of Alibaba Cloud’s system architecture, at the beginning, the business application and data was collectively deployed to a single machine. Then, the application and data was deployed in a hierarchical and multilayer hierarchical manner. After that, we had distributed clusters and then the extensive microservice transformation. Throughout this process, we can see that the flexibility between applications, and between applications and data has been constantly increasing.
After the completion of enterprise cloudization, networks are “visible” to services on the cloud. Therefore, networks must first provide various gateway-type services, and then provide the elastic scalability, security, reliability, and highly effective O&M capability as needed by services provided by the enterprise.
Moreover, Alibaba Cloud believes that in the future, the infrastructure will be no longer visible to enterprises on the cloud. Enterprises do not need to see underlying services such as computing, storage, and networks. All they need is to select frontend and middleware services from the cloud-based ecosystem, and then slightly adjust these services to build their own systems. The ultimate goal of Apsara Luoshen’s is to make networks invisible to end users.
The time at which Alibaba Cloud launched new network products is basically consistent with the pace of enterprise cloudization. At the beginning, we only provided single-instance products such as AVS and SLB. After cloudization of basic instances, we launched Express Connect, private lines, and SAG to connect cloud-based and on-premises systems. With the refining of the resource granularity and the expansion of the distribution scope, Alibaba Cloud launched Global Acceleration and CEN. With the evolution of enterprise architecture on the cloud, Alibaba Cloud Network will constantly launch diversified network products for different business systems to meet customer needs.
Key Features of Apsara Luoshen
Next, let’s introduce the key technologies of the Apsara Luoshen system. Three features: flexible, reliable, and smart.
The flexibility of Apsara Luoshen is reflected by two significant numbers. The first is that Apsara Luoshen supports elastic scaling of the forwarding bandwidth from one Mbit/s to one Tbit/s within one second. The second is the capacity scalability — up to 100,000 ECC instances within a single network are supported.
The outstanding elastic scaling capability of Alibaba Cloud is mainly attributable to the following two reasons:
- Data plane: At present, the Apsara Luoshen system supports X86-based, FPGA-based, and ASIC-based forwarding methods. How can we apply different forwarding technologies and forwarding products for different scenarios? Currently, most gateway and network source products mainly use X86-based software forwarding. However, they will gradually upgrade to smart NICs. There are scenarios with high and unstable bandwidth requirements. For example, some of Alibaba Cloud’s VRP customers require very high bandwidth for storage access, but the bandwidth requirement varies. In this case, Alibaba Cloud deploys ASIC chips to enhance the customer capabilities. With various forwarding technologies, Apsara Luoshen now has a huge forwarding resource pool that can be used to quickly meet the forwarding performance enhancement requirements.
- Control plane: Apsara Luoshen manages an enormous network system, and a conventional centralized single control plane is unable to meet the requirements. Therefore, the Apsara Luoshen system adopts a distributed hierarchical control plane, and very importantly it adopts a caching mechanism in the forwarding plane (also known as the data plane). Similar to the representation generation mechanism of a virtual machine, Apsara Luoshen automatically learns the representation through caching in the data plane. Instead of one-by-one forwarding through the control plane, automatic representation learning works much better than the centralized method in terms of both deep layer efficiency and deep layer speed. This allows Apsara Luoshen to quickly enable and disable compute nodes.
Last year, the average time of fault of all Alibaba Cloud network instances (excluding vSwitch networks) was 50 milliseconds, which is a very short period of time.
When it comes to reliability, the most fundamental requirement is to implement cross-data-center multi-active deployment.
Each key node of Apsara Luoshen consists of controllers and various gateways. Each data center is deployed with one or multiple clusters. Different nodes and data centers back up each other. When one node is down, Luoshen enables failover within the same cluster. When the number of faulty nodes within a single cluster exceeds the threshold, Luoshen enables failover between different clusters within the same data center or across different data centers. This avoids network unavailability due to the fault of a single node or a single cluster. This is actually an implementation of cross-data-center multi-active deployment.
Next, let me introduce you LuoShen’s quick fault detection and flow coloring system.
This system conducts coloring and matchmaking on specific flows. It works on Alibaba Cloud virtual networks and physical networks. During the operation of a device, the system conducts image sampling and adds timestamps to specific services and to flows with specific colors for real-time data analysis. The system quickly detects faults of specific flows through real-time data analysis. For example, upon detection of packet loss, it instantly notifies the network administrator to recover the fault. This is basically the same as the conventional IT logic, but the Apsara Luoshen system mainly works from a customers’ business perspective.
In fact, smartness and reliability are closely related to each other, because smart monitoring, smart O&M, and smart fault recovery can effectively reduce the time of a fault.
Speaking of smartness, we have to introduce Alibaba Cloud’s big data-based smart O&M platform, which is called Apsara QiTian inside Alibaba Cloud. Apsara QiTian collects various user information (such as the flow data and statistical data) through the data plane and control plane. JStorm, a data collection tool is then used to classify the data into basic data, monitoring data, and diagnostic data. Afterwards, it conducts matchmaking and computation on such data according to certain rules before outputting the analysis results to the alarm interface, APIs, and the robot. It also isolates and recovers certain faults.
Here are some scenarios of Apsara QiTian:
- Changes: As you know, many online problems are caused by network changes. How can we reduce the impact of changes on the business? Alibaba Cloud generally makes changes at time points with the minimum business traffic and the minimum customer impact. One of the time-point selection options is human intervention, which involves considerable amounts of effort. In addition, manual selection is accurate and may cause unforeseeable impact to the business. The Apsara Qitian platform can accurately create user network source profiles based on big data analysis, and select the time point with the minimum business traffic to make automatic changes. This greatly reduces the probability of affecting the customer’s business operations. In addition, sometimes we may need to restore some data centers and network sources at the time of new version release. In most cases, it is difficult to determine which data centers and network sources will be restored first. Based on big data analysis and Apsara QiTian, we can determine which ones can have the least impact on the business. This is also done by the smart platform, which allows us to minimize the business impact during automatic network changes.
- Exception detection: Apsara Qitian supports detection of various policy exceptions online. During the actual operation process, Apsara QiTian determines whether any policy exceptions exist. If yes, alarms are sent and the quick fault escape process begins. This process is very challenging, because it must make the best judgment on the overall situation. A comprehensive analysis on the business scenario is required to determine whether to block the circuit, or the node, or start a failover of the entire data center. In fact, when running in the Alibaba Cloud network environment, Apsara QiTian can comprehensively analyze the scenario, and make decisions on the best fault escape policy with minimal impact to the customer’s business.
Applications of Apsara Luoshen
Now, let’s take a look at the application of Apsara Luoshen. As of now, the network of Alibaba Cloud is enormous. There are millions of network devices and tens of millions of network instances running on its virtual network sources. It also monitors more than one thousand network metrics, and supports big data analysis. Events like the Double Eleven Global Shopping Festival have the impose the greatest pressure on the network. The data for this year’s Double Eleven Global Shopping Festival is not fully available yet. According to the currently available data, the peak transaction volume exceeded 325,000, and the bandwidth of a single SLB instance exceeded 160 Gbit/s. Now, all Alibaba Cloud services are running on the Apsara Luoshen system, which has already become an important service in the VPC network.
Finally, let’s summarize the mission of Apsara Luoshen. So far, Apsara Luoshen has undergone three generations. The first generation is the classic network, which was mainly used to solve the problem of connectivity. The second generation launched the VPC network for security isolation. We are currently using the third generation Apsara Luoshen, which provides solutions to connect to the cloud, and offers the same capabilities as conventional enterprise networks. Alibaba Cloud defines the final stage as Networkless. In this stage, the networks are invisible to end users. As indicated by its name — Apsara Luoshen, it is a deity that you cannot see, but it exists everywhere.