Zero-Copy Optimization for Alibaba Cloud Smart NIC Solution
By Chen Jing, Senior Technical Expert at Alibaba Cloud
During the VPC product deployment, Virtual Switch plays the role of encap/decap for the network protocol in the overlay layer and the underlay layer. It is also responsible for the routing, QoS, traffic limiting, and security groups for Layer 2 and Layer 3 on multi-tenant (virtual machine or container) hosts. Currently there are Open vSwitch (OVS) open source projects that are widely accepted in the industry. On Alibaba Cloud, we have also developed AVS projects that are suitable for our business. Application Virtual Switch (AVS) has similar functions to OVS, and it is also a pure software implementation. During our use, we found some problems, which led to the Smart NIC project.
First, AVS occupies host resources. As software, AVS occupies the computing resources and memory resources on the host. When the throughput is large, it also occupies a large number of LLC cache resources. However, the better use of these resources is to give them to customers to increase the utilization of resources.
Second, it has a performance bottleneck. In the cloud business deployment process, the pure software implementation cannot keep up with application’s growing demand for larger network bandwidth, so it is necessary to accelerate the AVS.
For these two reasons, we started the Smart NIC project.
Virtio or SR-IOV?
It is well known that VF devices using PCI-E SRIOV technology are presented in the VM by Pass-Through, which is characterized by high performance and rich functions. But it also has obvious drawbacks, such as inconvenient hot-plug, inconsistent interfaces of VF devices provided by different manufacturers, and a requirement for additional drivers. As a relatively common virtual device interface in the industry, Virtio can solve the above mentioned problems, but another problem is that there is yet no hardware manufacturer that fully supports devices with the Virtio interface, which is an implementation using software simulation, so the performance is not satisfactory. So, is there a solution that has high performance and supports Virtio too?
We designed a soft forwarding–based solution combining the advantages of Virtio and SR-IOV.
Smart NIC Solution
The figure below illustrates the framework design of our Smart NIC. There is a standard NIC ASIC and an SoC. The SoC has built-in memory and CPU. AVS runs on the SoC, offloading the fast-path to the ASIC, and is only responsible for the slow-path processing. The separation of the fast- and slow-paths not only reduces the processing load of AVS, but also greatly improves the throughput.
In the implementation of Smart NIC, because our standard card cannot be hardened by virtio-net, the output to the host is only a vendor-specific VF interface. On the client side (VM), we expect the user to use the virtio-net interface without any modifications. For this demand, we developed a simple DPDK-based soft forwarding program. Because AVS completes all the functions of policies, logic, routing, and multi-queue on the NIC, there is a one-to-one correspondence between the VF and the virtio-net frontend, and the soft forwarding program only needs to complete the interface conversion between VF and virtio-net and the message delivery. At the same time, we cannot allocate a lot of resources for this soft forwarding program, otherwise it will run counter to our original intention of developing Smart NIC. Therefore, the performance of soft forwarding has become a key indicator. Before describing our optimization method, let’s take a look at the regular path.
Message Receiving Path
Before mentioning the message receiving path, it is necessary to introduce how the receiving queue is initialized in DPDK. First, the software needs to allocate a buffer as a queue in memory. The data structure in the queue is referred to as a descriptor, which is the interface for interaction between software and hardware, or between frontend and backend drivers if it is a para-virtualized analog device such as virtio-net. In addition, the software also needs a memory pool, which contains fixed-size data buffers for carrying network messages sent through the NIC.
After the receiving queue is initialized, the NIC can work properly. When the message is received from the network, it is processed as shown in the figure below, and finally sent to the corresponding virtual machine.
First, the message is cached on Smart NIC, and then it is processed with the fast path or the slow path. Finally it is determined which VF and which receiving queue will the message be sent to.
Then, the NIC reads the descriptor from the corresponding VF receiving queue, obtains the address where the message is stored, then performs a DMA operation, and finally, writes the descriptor back to the receiving queue. The descriptor contains information such as the length, type, and error flag bit of the message.
The soft forwarding program finds that a message has been received through the descriptor of the VF receiving queue, then reads the receiving queue descriptor of the virtio-net in the VM, obtains the address of the DMA data in the virtual machine to be copied, and copies the message by memcpy. Then the descriptor information of virtio-net is updated and the frontend driver is informed before the entire logic is completed.
Optimized Zero-Copy Receiving Method
As can be seen from the above process, the message is copied twice in the entire path when it is sent to the virtual machine. First it is copied from the hardware DMA to the host memory, and then it is copied from the soft forwarding program to the virtual machine’s guest memory. The second copy is an operation that especially consumes a lot of CPU time. It also can be seen that memory copies take up a large proportion in the subsequent performance profiling. So, is it possible to reduce one copy?
Based on the above requirements, we have designed a zero-copy solution for the receiver. A queue synchronization module is designed as shown below.
Its main functions are as follows:
- Establishing a 1:1 mapping between the frontend virtio-net and the description of the SR-IOV VF queue.
- Monitoring changes of the frontend virtio-net receiving queue descriptor and synchronizing the changes to the receiving queue of the VF. Thus, in the VF’s receiving queue, the memory pool only needs to directly use caches passed by the virtual machine instead of caching the message received from the network card.
- Converting the guest physical address (GPA) to the host physical address (HPA). In the VF of virtio-net, the driver fills in the GPA. In the case of not using IOMMU, the GPA needs to be converted into an HPA, so that Smart NIC can transfer the DMA.
With the queue synchronization module, the message receiving process is as shown in the following figure.
The major improvement is that Smart NIC directly sends DMA into the memory specified by the virtual machine, and the hardware does not perceive this change. The soft forwarding program simply converts the descriptor information of the VF receiving queue, and fills it into the receiving queue of the frontend virtio-net.
In this article, we introduced the motivation, functional framework, and soft forwarding program of Smart NIC developed by Alibaba Cloud, as well as problems found during the soft forwarding process and corresponding optimization methods.
Alibaba Cloud has designed a zero-copy solution for soft forwarding on our Smart NIC project. Furthermore, according to our tests and analysis, we found that zero-copy has the following advantages:
- Reduces memory copy at the receiver.
- Reduces footprints, avoiding cache-coherency.
- Improves overall performance by 40%.
- Reduces the time overhead of copying.