Alibaba Builds High-Speed RDMA Network for AI and Scientific Computing

Alibaba Cloud
3 min readJun 13, 2019

--

Among many cloud computing providers who have deployed RDMA (Remote Direct Memory Access) networks in their data centers, Alibaba has already gained one preemptive advantage: Alibaba has taken the lead with the scale of its RDMA network in its data centers. Currently, dozens of its data centers support the RDMA network, which significantly reduces latency by 90% and can perfectly meet the requirements in scenarios such as artificial intelligence and scientific computing.

Beijing Winter Olympics Cloud Data Center at Alibaba Cloud

Alibaba Cloud products such as the high-performance cloud disk ESSD, the cloud-native database POLARDB, Super Computing Cluster (SCC), and PAI run on the RDMA network. These highly popular products have shared the benefits of the network technology advances.

Currently, RDMA is the most popular high-performance network technology in the industry, which can significantly reduce data transmission time and is considered the key to increase AI and super computing efficiency. Statistics show that, when the RDMA network is not used, the duration of each task iteration for speech recognition training is between 650 ms and 700 ms, of which 400 ms is the communication latency.

To improve the data transmission speed and meet user needs, leading cloud providers such as Amazon and Microsoft begin to focus on the R&D and deployment of this technology. However, few enterprises have implemented the large-scale application of RDMA in data centers.

In 2016, Alibaba launched a special research project to reform RDMA and improve the transmission performance. Alibaba began to design networks that can meet the large-scale application from the underlying layer of network interface controllers and combined its own vSwitch to maximize the performance. Finally, Alibaba successfully built the high-speed network in the largest data centers in the world, eliminating the transmission speed bottlenecks in clusters and reducing latency by 90%.

Take the 2018 Tmall Double 11 event for example: The RDMA-based cloud storage and e-commerce database server easily processed the large amounts of traffic during peak business hours. SAIC Motor is adopting SCC supported by RDMA to implement simulation and has improved the overall efficiency by 25%.

“RDMA has become essential for high-performance and storage services such as AI and scientific computing. In the future, we will continue to explore network technologies that enable higher bandwidth and deploy a 100G high-speed network to provide enterprises with highly stable and low-latency network services,” said Cai Dezhong, Chef Network Architect at Alibaba.

As a cloud service provider ranking 1st in China and top 3 in the world, Alibaba Cloud currently has 56 availability zones in 19 global regions. The total network bandwidth has reached the PB level. Currently, Alibaba Cloud is working on the R&D of the 400G network. The 400G QSFP-DD industry standard put forward by Alibaba Cloud has been widely recognized by global enterprises.

Reference

https://www.alibabacloud.com/blog/alibaba-builds-high-speed-rdma-network-for-ai-and-scientific-computing_594895?spm=a2c41.13011829.0.0

--

--

Alibaba Cloud
Alibaba Cloud

Written by Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com