Alibaba Engineers Team up with World-Renown Mathematician To Develop Next-Gen Algorithm
In 2008, when the British mathematician-and professor at the University of Cambridge — Frank Kelly won the John von Neumann Theory Prize for operational optimization with his original network theory, he may not have thought that 11 years later, he would be personally involved in the “reworking” of his original brainchild with Alibaba Cloud, but that’s what happened in recent months.
Frank Kelly was invited to build a new theory and framework by a group of Alibaba engineers-lead by Zhang Ming-who are very familiar with Kelly’s research and have been greatly influenced by him from as early as when they were still pursuing their PhDs in the United States.
Alibaba representatives presenting at the SIGCOMM 2019 conference
Recently, this team of young Alibaba engineers and Kelly proposed their research on a high-precision congestion control (or HPCC for short) algorithm, which will serve as an all-new algorithm for congestion control in high-performance networks, which is an immense challenge currently in the industry. The results of their research were presented at SIGCOMM 2019. This algorithm makes for yet another notable breakthrough in terms of research at Alibaba Group to this date.
Kelly and the young team of engineers first met up in the Alibaba Cloud Apsara Conference mathematics competition. Frank Kelly, being a world-renowned mathematician and theorist, is reputed for his successful model and explanation of the congestion control involved in the Transmission Control Protocol (TCP). Kelly applied economic theory to analyze the astringency and fairness of the protocol, whilst also theoretically demonstrated the stability and effectiveness of the protocol. And not too far off from then, Zhang Ming (now an Alibaba Cloud Intelligence researcher) was pursuing a PhD at Princeton University. The work of Professor Kelly is one of the classic, well-reputed papers in Zhang’s research field. Many of Kelly’s viewpoints greatly inspired Zhang Ming, and this influence can be seen in Zhang’s work on network research.
A photo of Frank Kelly and the authors of relevant papers of Alibaba HPCC. Zhang Ming, in a black knitted sweater, is standing immediately to the right of Kelly.
With the rapid development of cloud computing, performance-driven and higher-than-ever-performance network architecture designs are becoming mainstream for large-scale data centers nowadays. However, with these changes, the congestion control of the TCP protocol first demonstrated by Kelly is being increasingly challenged. In fact, this issue also happens to be one of the core challenges currently in data-center network design. So, with all that said, who would be a better fit to figure out a solution than Kelly himself.
Kelly was invited to attend the opening ceremony of Alibaba Cloud’s Apsara Conference mathematics competition in September 2018, which was held in Hangzhou. At the Competition and through exchanges with Zhang Ming, Kelly learned about the challenges of high-speed network congestion control that Alibaba was facing with the next-generation network infrastructure designs being rolled out by Alibaba Group.
Frank Kelly communicating with the group of Alibaba engineers, discussing the intricacies of congestion control.
As an early researcher on congestion control, Professor Frank Kelly over the years has continuously shown a great enthusiasm and interest in the subject. With Professor Kelly’s keen sense for network theory and Zhang Ming’s and his team’s rich practical experience in building high-performance data center networks, they together came to the same bold conjecture: It is likely that the stability of the current mainstream congestion control algorithm for high-speed networks cannot be proved theoretically, and in fact this is the root cause of the series of engineering problems associated with these types of network architecture designs.
Having such a rapport from the start, Kelly and Zhang’s team decided to work together long term to solve the congestion control problem seen in large-scale and high-performance networks. Through months of frequent phone conversations between the Alibaba’s Seattle office and Cambridge University, many ideas were shared between the two.
And after more than 4 months of long-distance cooperation, the two sides have finally designed a new high-precision congestion control (HPCC) algorithm. Unlike current mainstream congestion control algorithms, the HPCC algorithm not only guarantees stability, efficiency, and fairness in theory, but it also controls network latency in congestion scenarios dozens to hundreds of times better than mainstream algorithms.
When developing the algorithm the team and Kelly had to consider the following reality:
Computers all over the place are running very different operating systems. As a result, computers use very different methods for expressing the same information, which in turn means it is very much necessary to establish a standard network protocol to connect different computers in order to complete the communication between computers. This is the reasoning behind why the TCP protocol came into being.
Just like laying a unified path for the computer world, TCP makes real-time transmission of information possible. In a real-world traffic network, it is not enough to have only wide roads without traffic lights, rather, good traffic rules, and traffic control and guidance systems are necessary. Similarly, in the computer and Internet world, network bandwidth resources are limited. The team of Alibaba engineers and Professor Kelly had to design a new congestion control algorithm, which similar to the real-world system of “traffic lights” could ensure that the traffic on each server is both manageable and fair. Their algorithm neither causes network freezing due to too few bandwidth resources nor allows a party to gain excessive bandwidth and block other traffic.
At Alibaba, engineers are engendered with both a strong sense of idealism and realism, and so for Alibaba technological innovation has to be something that isn’t only theoretical or stuck in the laboratory but something that can meet the masses. As such, the Alibaba Network Team immediately began testing of the new algorithm developed by Kelly and the young group of engineers on the software and hardware, which simulated the real production environment, even at the initial stages of development.
After constructing an elaborate software and hardware design with over 40,000 lines of written code, the Alibaba Network Team quickly implemented the HPCC algorithm and related protocol stack prototype in just 2 months. Test results show that HPCC algorithm not only can be implemented efficiently on existing hardware, but its effect is also highly consistent with the connected theoretical analysis. In other words, the HPCC algorithm has opened a new research idea and direction for the next generation of high-performance network congestion control and will also have a long-term and profound impact on the design and operation of cloud computing networks.
The investment and support for basic research should not only be at the theoretical level, but also needs to be closely integrated with front-line engineers and application scenarios. Perhaps this is why Alibaba established the DAMO Academy to support basic research. As Zhang Jianfeng, president of Alibaba Cloud Intelligence and president of the Damo Academy, said at the Alibaba Cloud Summit in 2019: “The scientific research strength of the whole of Alibaba Group will be integrated, and the capabilities of the DAMO Academy will be fully integrated with the cloud. And in the future, we will invest more R&D to expand the technical generation advantages of the cloud.”
With the rise of the technological strength of Alibaba and based on its emphasis on basic research, we can see that Alibaba’s technological journey has a bright future.