Best Practices of Global CDN QoS Optimization

Alibaba Cloud Content Delivery Network (CDN) serves more than 300,000 customers through its 1,500+ nodes deployed across over 70 countries and regions on six continents with a bandwidth capacity greater than 120 Tbit/s, making Alibaba Cloud the only CDN provider in China that has been rated as “Global” by Gartner. As the business and node bandwidth grow, quality of service (QoS) optimization has become a topic that is worth discussing. In the Apsara User Group — CDN and Edge Computing Session at the Computing Conference 2018 in Hangzhou, Liu Tingwei, Senior Technical Expert at Alibaba Cloud, shared with the audience the technical practices of CDN QoS optimization.

Image for post
Image for post

Global Development Process of Alibaba Cloud CDN

In the next few years, Alibaba Cloud CDN released the high-speed CDN 6.0, P2P CDN (PCDN), Secure CDN (SCDN), Dynamic Route for CDN (DCDN), and other related products in succession. In March 2018, Alibaba Cloud was rated as “Global” by Gartner in its latest Market Guide for CDN Services. In this summer, Alibaba Cloud CDN carried 70% of the live traffic for the 2018 FIFA World Cup.

However, as the business and node bandwidth grow, QoS problems have become more and more prominent. Liu said, “When discussing QoS optimization, we need to consider the entire ecosystem to combine technical and industry backgrounds and search for all possible directions to which optimization can be made. This figure best describes how we felt at the beginning: We could see our goals ahead, but we had no way out. Then, our team tried to calm down and make up our minds to completely solve QoS problems.”

Image for post
Image for post

The following figure is a simplified CDN logical architectural diagram, which shows that CDN is a complete ecosystem. From the perspective of a user’s access path, we can see what subsystems are involved in each phase of the ecosystem. First, the user’s access request is sent to a scheduling subsystem, in which the carrier’s local Domain Name System (LDNS) and the CDN’s scheduling system jointly assign a nearest edge node to the user. Then, when the access request arrives at the edge node, the user obtains the corresponding live, on-demand, download, trend, or real-time audio and video content from the edge node. This part is implemented by the cache system. Next, the link quality system takes charge of data transmission between cache nodes, between cache nodes and the user, and between cache nodes and the origin. Finally, there is a business support system for configuration, data, and monitoring.

Image for post
Image for post

During his sharing, Liu focused on scheduling, link quality, and Five-hundred-meter Aperture Spherical Telescope (FAST)-based monitoring. In just half an hour, he vividly and splendidly illustrated the QoS optimization practices in the entire CDN ecosystem. The audience said that they benefited a lot from his speech.

Scheduling Subsystem Optimization

Node Coverage

Image for post
Image for post

Scheduling: LDNS

“Another issue is LDNS profiling. We know that different LDNSs vary in processing polling, forwarding, and the time to live (TTL). The frontend and backend IP addresses of an LDNS are also in different proportions. We cannot request all LDNSs to resolve IP addresses in accordance with our requirements. However, we can adjust our own resolution policy based on the LDNS characteristics. Some people think that we do not need to care about LDNS problems because we can use 302 correction. Can we just rely on 302 correction?” Liu said, “The answer is no. The round trip time (RTT) of each 302 scheduling is about 20 ms, which cannot be underestimated. When we optimize the QoS of small files such as images, we can improve the QoS only by 1–2 ms. For example, the response time for processing an image of 1 KB is around 10–15 ms. If another 20 ms is added, the performance deteriorates by half. As a result, the user conversion rate of our customers may drop by 10%, which is unbearable.”

Flow Control

Image for post
Image for post

How do we achieve offline and real-time flow control? We compare each node to a bottle and scheduling traffic to sand. Then, DNS traffic can be compared to big stones, whereas HttpDNS traffic and 302 traffic are more like sand. In this context, the scheduling granularity can be specific to each case. Offline planning focuses on large chunks of DNS traffic. Real-time flow control deals with HttpDNS traffic and 302 traffic. The combination of offline and real-time flow control ensures that the water level is balanced on each node and that each user is dispatched to the optimal node. Alibaba Cloud’s flow control system not only well supports the normal CDN business, but also withstands the burst traffic in events such as the 11.11 Global Shopping Festival, CCTV Spring Festival Gala, and FIFA World Cup.

Link Subsystem Optimization

Image for post
Image for post

Next, Liu led the audience to link quality optimization. The problem to be solved by the link subsystem is how to efficiently transmit data over existing network links.

Optimization Scheme for a Fixed Unidirectional Link: Protocol Stack Optimization

Image for post
Image for post

Routing Decision-Making

Image for post
Image for post

When we open AutoNavi to plan a route from the Alibaba Center to the Beijing South Railway Station, we can find that:

First, AutoNavi marks the congestion level on each road of each route, which is equivalent to real-time detection of network quality.

Second, AutoNavi has planned three routes for us, which is equivalent to route selection. The CDN also selects three shortest paths between the origin and the user.

Third, AutoNavi gives options such as high-speed priority and congestion avoidance, as well as the restricted areas under the distinctive license plate policy in Beijing. This feature is equivalent to link affinity. For example, if the origin is located in China Mobile Tietong, the CDN is then connected to China Mobile Tietong.

FAST System

Image for post
Image for post

The FAST system aims to detect and diagnose QoS problems in a timely manner, and automatically process the problems based on historical experience. It mainly provides two features: The monitoring feature detects problems in a timely manner, while the alarm processing feature automatically processes the detected problems based on historical experience.


Image for post
Image for post

All these network data is collected at the kernel level. Then, we can use such data to identify basic QoS problems of download business. If a customer encounters stalling during live broadcast, how can the problem be identified through monitoring? In this scenario, full-link monitoring is required. Alibaba Cloud implements a full-link and global data monitoring system for live broadcast business.

Alarm Processing

As the business develops, alarms are configured for various data sources. As a result, the system is flooded with more and more alarms, which cannot be effectively processed. Therefore, the primary problem to be solved is how to ensure the accuracy and convergence of alarms.

How do we conduct alarm convergence?

  1. Based on business characteristics and abnormal events that trigger alarm reporting, associate and merge alarms to reduce duplicate alarms.
  2. Process alarms in work orders, and follow up on and collect statistics from these work orders.
  3. Classify alarms, which is very critical. Different notification methods, notification access modes, and response time are required for work orders of alarms at different levels. For example, for high-priority alarms, we must guarantee the response time to achieve the timeliness and accuracy of alarm processing; for low-priority alarms, we need to pay more attention to the follow-up tracking and processing rate.

After alarm convergence is conducted, we may still find that manual processing is not timely enough. In this case, the automated processing capability is required. Automated processing actually relies on the transformation of manual processing experience into system capabilities. That is, we can train the system with the common troubleshooting methods, data, and means to routinize the troubleshooting procedure.

For example, live broadcast is very sensitive to stalling, and it also raises high requirements for the response time. We can formulate a procedure based on the troubleshooting means and processing of stalling problems. Once a problem is detected, an alarm is directly generated to inform the system of the occurrence, cause, and repair strategy of this problem.

In addition, we need to be aware of the following issues:

First, the traps of automated processing. We cannot be so naive to believe that everything can be automated, because simple automated processing often sets huge traps in the system. We can use the isolation of node faults as an example. If 1,000 out of 1,500 nodes are detected faulty, what should we do? Can we repair the faulty nodes offline in peace while just relying on automated processing? If we do so, the remaining 500 nodes will be overwhelmed by traffic. We must set a fuse to achieve controllable automated processing.

Second, the exhaustivity of automated processing. Automated processing cannot be a simple repetition of manual experience. We require the system to possess the capability of auto-learning. With the machine learning capability, the system can use the abnormal data related to alarms to develop its own capability of processing unknown problems.

Customer Examples

The first example is AirAsia. Alibaba Cloud provides AirAsia with a global DCDN solution, which provides route optimization and tunneling services for dynamic data. The solution has helped AirAsia increase the response speed by 150%.

The second example is Tokopedia, the biggest online sales platform for customer to customer (C2C) business in Indonesia. Alibaba Cloud provides Tokopedia with a full-link Hypertext Transfer Protocol Secure (HTTPS) SCDN solution, which guarantees secure payments and achieves acceleration by more than 100%. Taking advantage of the Auto Scaling feature of Alibaba Cloud CDN, this solution has helped Tokopedia easily deal with burst traffic, which is dozens of times more than usual, in multiple sales promotions.

The third example is Toutiao, the fastest growing mobile Internet product in China. Besides Toutiao, the customer also offers TikTok (known as Douyin in China), Vigo Video (known as Huoshan in China), and other popular products based on its leading technical capabilities in the country. Alibaba Cloud CDN works together with the Toutiao technical team to establish an end-to-end (E2E) short video quality monitoring system. After overall optimization, the sum of the video stalling rate, interruption rate, and failure rate has been lowered to less than 1%.

The last example is Huya, a leading interactive live broadcast platform in China. Alibaba Cloud CDN and Huya have jointly implemented E2E full-link monitoring for live broadcast. The solution can monitor and locate live broadcast problems and causes in real time, to ensure users’ smooth experience of live broadcast on Huya.

At the end of the speech, Liu said, “Although we have done a lot of work to optimize the QoS, there is still a long way to go. We welcome all experts to join us and work together to build the best CDN in the world!”


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store