Evolution of the Alibaba Cloud Pangu Distributed Storage Technology
In recent years, with development of storage technologies such as the NVME flash storage, the IO bandwidth of a single storage device has reached GB/s, and the delay decreases to microseconds. The next-generation memory-class non-volatile storage media, such as Intel 3D XPoint, will further improve the storage media performance. In general, the data center has transited from the millisecond era of the traditional mechanical hard disk to the microsecond era of the new storage media. While the performance is significantly improved, the node storage capacity also increases rapidly to dozens of TBs. Rapid development of the node storage performance and capacity raises higher requirements for performance of network communication between nodes in the distributed storage system. Therefore, high performance network technologies, such as remote direct memory access (RDMA), start to be deployed in data centers. RDMA is originated from the high performance computing field, such as the TaihuLight supercomputer. It implements highly efficient node-to-node communication through the relatively custom traffic control mechanism.
Application of the new storage and network storages leads to transformation of the underlying physical architecture of the data center. The Master Lu’s Spring and Autumn Annals has a famous saying, “as the time and environment change, laws should change as well”. This sentence indicates the importance of adapting to changes, a concept that the team at Alibaba Cloud has fully embraced.
The Pangu distributed storage system is originated from the Alibaba Cloud Apsara system and has been deployed for more than 10 years. It is developed by Alibaba Cloud and one of the core components of Alibaba Cloud. Pangu supports multiple types of key storage services, such as Alibaba Cloud OSS, MaxCompute, Block Storage, and Network Attached Storage. With years of development, Pangu has become the storage platform of the entire Alibaba Group, supporting Tmall, Taobao, Alipay, and more services. The next-generation Pangu storage system fully utilizes the new technologies such as NVME and RDMA to implement high performance storage service. Its end-to-end triplicate write delay is shorter than 30 microseconds, approaching the theoretical physical delay of the underlying hardware. The Pangu-based ESSD cloud disk can achieve a performance of 100 us and 1M IOPS.
In the Alibaba 6.18 shopping festival this year, Pangu is used as the basic storage system to service the Alibaba core business, such as Tmall, Taobao e-commerce database, and Alibaba Cloud Block Storage. Before that, no enterprise in China or abroad has ever applied the new RDMA and NVME technologies to core services, such as the online large-scale database and cloud computing block storage. For the first time, Pangu realizes large-scale application of these technologies on online core services.
Figure 1: Pangu distributed storage system
Super-high performance of the Pangu storage system is the result of ultimate mining of the storage and network performance, especially R&D of the RDMA technology. Pangu chooses RDMA for two reasons: performance and semantics. In terms of performance, delay and CPU usage of RDMA are superior to those of the traditional TCP communication because the hardware of the RDMA network adapter processes the communication protocol. The point-to-point delay of RDMA is close to 1 us. Under the same conditions, point-to-point delay of TCP is 200 us or above. RDMA can use a CPU core to fully utilize the network bandwidth, whereas TCP requires at least four CPU cores for the same purpose. The CPU usage of RDMA is obvious higher than that of TCP. In terms of semantics, RDMA provides a communication mechanism that can send data of node B reliably to node B, and implements remote memory access semantics. Reliable transmission and implementation of memory semantics enable the remote processing units, such as the CPU, FPGA, and GPU, to directly process data in a memory addressing range. In contrast, as TCP uses the byte stream semantics, it is difficult for the receiver to determine the boundary of data and consequently difficult to directly process data. The processing unit must be used for parsing. With development of the large-scale high performance devices, such as the AEP new storage media and dedicated processing chip like the hardware co-processing chip, it is imperative for the Pangu system to directly process data. Hence, the advantage of RDMA in this aspect becomes more prominent. The Pangu distributed storage system fully utilizes the RDMA features through the full user-mode system software stack. The end-to-end Pangu software library overhead is lower than 3 us. In this way, an efficient IO performance is achieved.
Figure 2: Pangu full user-mode software stack
The RDMA network performance is outstanding. In practice, due to factors such as the cost, the RDMA ROCE technology is widely used in data centers, instead of the Infiniband RDMA technology in the high performance computing field. ROCE RDMA implements lossless communication by means of patching over an Ethernet that allows packet loss. Realizing lossless communication over a network that allows packet loss itself introduces great risks. Compared with the earlier implementations that allowed packet loss, this implementation is prone to systemic risks of the network. This is also a tough issue to be resolved when ROCE RDMA is deployed in data centers worldwide.
If RDMA is compared to an expressway, TCP is similar to a provincial road. The expressway adopts the independent isolation mechanism (dedicated isolated road) and dedicated passage rules to achieve high efficiency in transportation. In contrast, to meet the accessibility requirements between the place of departure and destination and connect different places, the provincial road is also in pursuit of high efficiency, but its passage rules are not so strict due to compromise of factors such as the cost. As the expressway is isolated and allows high-speed traffic, the risks of the expressway are obvious higher than those of the provincial road in the snowy or misty weather.
The same is true for RDMA. Great risks exist while RDMA achieves a high performance. For ROCE RDMA, the RDMA technical experience of network adapters and switches supplied by different vendors are still in the accumulation phase, and many problems exist in traffic control policies and parameter settings. Therefore, the risks of ROCE RDMA are large for the network adapters and switches, which are the ROCE RDMA network carriers. Despite of the high risks of ROCE RDMA, as services supported by Pangu, such as e-commerce and Alibaba Cloud storage must run stably for 24 hours a day all year round, Pangu must be able to eliminate all the risks and run stably. In practice, Pangu and Alibaba network teams adopt the software and hardware collaboration method to minimize the risks of ROCE RDMA while ensuring the performance.
With ultimate pursuit of RDMA performance and the software and hardware collaboration design to improve reliability, Pangu implements the RDMA-based high efficient and stable operation, and has been deployed for the first in key services such as database and Alibaba Cloud Block Storage in the Alibaba 6.18 shopping festival. In addition, a series of R&D activities are also conducted to improve the RDMA QoS, network storage convergence, and RDMA-based near-line storage computing. Later, Pangu will support more Alibaba services and will be further tested and promoted in the coming 11.11 (Singles’ Day) shopping festival, providing the efficient and stable storage service for users.