How Does Alibaba Cloud Build High-Performance Cloud-Native Pod Networks in Production Environments?

Image for post
Image for post

By Xiheng, Alibaba Cloud Technical Expert

Image for post
Image for post

On April 16, we livestreamed the second SIG Cloud-Provider-Alibaba seminar. This livestream demonstrated how Alibaba Cloud designs and builds high-performance pod networks in cloud-native environments. This article recaps the content in the livestream, provides download links, and answers questions asked during the livestream. I hope it will be useful to you.

First, I will introduce Cloud Provider SIG. This is a Kubernetes cloud provider group that is dedicated to making the Kubernetes ecosystem neutral for all cloud providers. It coordinates different cloud providers and tries to use a unified standard to meet developers’ requirements. As the first cloud provider in China that joined Cloud Provider SIG, Alibaba Cloud also promotes Kubernetes standardization. We coordinate on technical matters with other cloud providers, such as AWS, Google, and Azure, to optimize the connections between clouds and Kubernetes and unify modular and standard protocols of different components. We welcome you to join us in our efforts.

With the increasing popularity of cloud-native computing, more and more application loads are deployed on Kubernetes. Kubernetes has become the cornerstone of cloud-native computing and a new interaction interface between users and cloud computing. As one of the basic dependencies of applications, the network is a necessary basic component in cloud-native applications and also the biggest concern of many developers as they transition to cloud-native. Users must consider many network issues. For example, the pod network is not on the same plane as the original machine network, the overlay pod network causes packet encapsulation performance loss, and Kubernetes load balancing and service discovery are not sufficiently scalable. So, how do we build a cluster pod network?

This article will describe how Alibaba Cloud designs and builds high-performance, cloud-native pod networks in cloud-native environments.

The article is divided into three parts:

  • Kubernetes Pod Network Overview
  • Building a High-Performance, Cloud-Native Pod Network
  • Enhancing Network Scalability and Performance

Kubernetes Pod Network Overview

First, this article introduces the basic concepts of the Kubernetes pod network:

  • Pod Network Connectivity (CNI)
  • Kubernetes Load Balancing (Service)
  • Kubernetes Service Discovery (CoreDNS)

The following figure shows the Kubernetes pod network.

Image for post
Image for post

Pod network connectivity (CNI) involves the following factors:

  • A pod has its own network namespace and IP address. Applications in different pods can listen to the same port without conflicts.
  • Pods can access each other using their own IP addresses. Pods in a cluster can access other networks using their own IP addresses, which allow communication between pods, communication between pods and nodes, and communication between pods and external networks.

To achieve these network capabilities, address assignment and network connectivity are required. These capabilities are implemented with CNI network plugins.

Container Network Interface (CNI) provides a series of network plugins that implement APIs allowing Kubernetes to configure pod networks. Common CNI plugins include Terway, Flannel, and Calico.

Image for post
Image for post

When we create a pod:

  1. Kubelet first listens to pod creation from the ApiServer and creates a pod sandbox.
  2. Then, Kubelet calls the CNI plugins through CNI to configure the pod network.
  3. CNI configures the network namespaces of pods and enables network access between different pods.

Typically, pods are not in the same plane as the host network, so how can the pods communicate with each other? Generally, the following two solutions are used to connect pods:

  • Packet encapsulation: Communication packets between pods are encapsulated into packets between hosts.
  • Routing: Communication packets between pods are forwarded to corresponding nodes through the routing table.
Image for post
Image for post
Image for post
Image for post

We need Kubernetes Service for the following reasons:

  • Pods have a short lifecycle and changeable IP addresses, so they need a fixed access method.
  • Pod groups in a Deployment need a unified access entry and require load balancing.
Image for post
Image for post
  • When a Kubernetes Service object is created, it is assigned a fixed Service IP address.
  • You can use labelSelector to select a group of pods and perform load balancing at the Service IP address and port for the IP addresses and ports of the selected group of pods.
  • This ensures a fixed access entry and load balancing for the group of pods.
Image for post
Image for post
  • Even though the Service provides a fixed IP address access method, the Service has different IP addresses in different namespaces or clusters. In this case, how do we unify the access entry?
  • CoreDNS in the cluster automatically converts Service names into Service IP addresses to ensure the same access entry is implemented in different deployment environments.
Image for post
Image for post

Building a High-Performance, Cloud-Native Pod Network

The cloud IaaS layer network is already virtualized. If network virtualization is further performed in pods, the performance loss is significant.

The cloud-native container network uses native cloud resources on the cloud to configure the pod network.

  • Pods and nodes are in the same network plane and have the same network status.
  • The pod network can be seamlessly integrated with cloud products.
  • Without requiring packet encapsulation and routing, the pod network provides network performance equivalent to virtual machines.
Image for post
Image for post

CNI calls the cloud network open APIs to allocate network resources.

  • Generally, network resources are elastic network interfaces (ENIs) and ENI secondary IP addresses bound to the nodes where pods are located.
  • After network resources are allocated, the CNI plugin allocates the resources to the pod sandbox in the host.
Image for post
Image for post

Pod networks become first-class citizens in a VPC. Therefore, using cloud-native pod networks has the following advantages:

  • Pods and virtual machines are at the same network layer, which facilitates cloud-native business migration.
  • Network devices allocated to pods can be used for communication without depending on packet encapsulation or routing.
  • The number of nodes in a cluster is not restricted by the quota of the routing table or encapsulation FDB table.
  • Overlay CIDR blocks do not need to be planned for pods. Pods in different clusters can communicate with each other only if the security groups are opened.
  • Pods can be mounted to the LoadBalancer backend without requiring port forwarding on nodes.
  • The NAT gateway can perform SNAT for pods, and SNAT does not need to be performed for pods on nodes. Pods use their IP addresses to access VPC resources, which facilitates auditing. Pods do not depend on conntrack (connection tracking) SNAT to access external networks, reducing the failure rate.

IaaS layer network resources (using Alibaba Cloud as an example):

  • ENIs: Virtualized ENIs at the IaaS layer can be dynamically allocated and bound to VMs. Typically, the number of ENIs that can be bound is restricted by PCI-E.
  • ENI secondary IP addresses: Typically, an ENI can be bound to dozens of VPC IP addresses as secondary IP addresses.

ENIs or ENI secondary IP addresses are allocated to pods to implement the cloud-native pod network.

Image for post
Image for post

How to solve the gap between cloud resources and rapid pod scaling:

  • Pods are started in seconds. However, IaaS layer operations and calls typically require about 10 seconds.
  • Pods are frequently scaled in and out. However, the APIs of cloud products usually implement strict throttling.

Terway uses an embedded resource pool to cache resources and accelerates startup.

  • The resource pool records the in use and idle resources allocated to pods.
  • After a pod is released, resources are retained in the resource pool to ensure a quick startup next time.
  • The resource pool also has minimum and maximum levels. When the number of idle resources is lower than the minimum level, APIs are called to supplement and pre-load resources to reduce the number of API calls needed when a large number of pods are created. When the number of idle resources is higher than the maximum level, APIs are called to release excess resources.
Image for post
Image for post

You can configure, call, and batch apply for resources for concurrent pod networks, as shown in the following figure.

Image for post
Image for post

We must also consider many other resource management policies, such as how to select vSwitches for pods, to ensure sufficient IP addresses, and how to balance the number of queues and interrupts of the ENIs on each node to ensure minimal competition.

For more information, check the Terway documentation and code.

Exclusive ENI Mode

This mode is implemented in CNI:

  1. The Terway resource manager binds ENIs to nodes where pods are located.
  2. Terway CNI adds ENIs to the pod network namespaces.
  3. Terway CNI configures network data, such as IP addresses and routes for ENIs.
Image for post
Image for post

This method has the following features and advantages:

  • The pod network does not pass through the host network stack.
  • The performance is the same as the ECS server, with no performance loss.
  • Pods use ENIs, DPDK, and other methods to accelerate applications.

This mode is implemented in CNI:

  1. Based on the number of IP addresses applied for and the existing ENIs, the Terway resource manager determines whether to apply for ENIs or secondary IP addresses.
  2. Terway CNI creates IPVLAN sub-interfaces on the ENI.
  3. Terway CNI puts IPVLAN sub-interfaces into the pod network namespace.
  4. Terway CNI configures the IP address and routing information for the pod network namespace.
Image for post
Image for post

This method has the following features and advantages:

  • Only a portion of the IPVLAN network goes through the network stack without passing through iptables or the routing table. The performance loss is low.
  • One ENI typically supports 10 to 20 secondary IP addresses. You do not need to worry about the deployment density.
Image for post
Image for post
  • The TCP_RR, UDP, PPS, bandwidth, and latency are all superior to those of the overlay network of the common Flannel VXLAN solution.
  • The exclusive ENI mode can fully utilize the network resources of machines without PPS or bandwidth loss. This mode is suitable for high-performance computing and gaming scenarios.

Enhance Kubernetes Network Scalability and Performance

By default, Kubernetes Service implements kube-proxy and uses iptables to configure Service IP addresses and load balancing, as shown in the following figure.

Image for post
Image for post
  • The iptables link during load balancing is long. As a result, network latency increases significantly. Even in IPVS mode, iptables needs to be called.
  • Poor scalability: iptables rules are synchronized in full mode. When the number of services and pods is large, it takes about 1s to synchronize rules each time and the data link performance decreases dramatically.
Image for post
Image for post

Kubernetes NetworkPolicy controls whether to allow communication between pods. Currently, mainstream NetworkPolicy components are implemented based on iptables, which also have iptables scalability issues.

  • Linear iptables matching has poor performance and scaling capability.
  • Linear iptables update is slow.

eBPF is described as:

  • eBPF is a programmable interface provided in the latest Linux version.
  • The eBPF program is injected to the ENI using tc-ebpf.
  • eBPF is used to greatly reduce the network link length and complexity.
Image for post
Image for post

As shown in the preceding figure, if you use the tc tool to inject the eBPF program to the ENI of a pod, Service and NetworkPolicy can be solved in the ENI. Then, network requests are sent to the ENI. This greatly reduces network complexity.

  • Each node runs the eBPF agent to listen to Service and NetworkPolicy and is configured with the ingress and egress rules of the pod ENI.
  • The egress eBPF program judges request to the Kubernetes Service IP address and balance the load among backend endpoints.
  • The ingress eBPF program calculates the source IP address based on the NetworkPolicy rules and determines whether to transmit a request.
Image for post
Image for post

Note: We use Cilium as the BPF agent on nodes to configure the BPF rules for pod ENIs. For more information about Terway-related adaptation, please visit this website.

  • After eBPF simplifies the link, the performance is improved significantly by 32% compared to when iptables is used and 62% compared to IPVS mode.
  • Through eBPF programming, there is almost no performance loss when the number of Services is increased to 5000. However, when the number of services is increased to 5000, the performance loss in iptables mode is 61%.
Image for post
Image for post
Image for post
Image for post

A Kubernetes pod must perform many searches when it resolves the DNS domain name. As shown in the preceding figure, when the pod requests aliyun.com, it will resolve the following DNS configuration in sequence:

  • aliyun.com.kube-system.svc.cluster.local -> NXDOMAIN
  • aliyun.com.svc.cluster.local -> NXDOMAIN
  • aliyun.com.cluster.local -> NXDOMAIN
  • aliyun.com -> 1.1.1.1

CoreDNS is centrally deployed on a node. When a pod accesses CoreDNS, the resolution link is too long and the UDP is used. As a result, the failure rate is high.

Change a client-side search to a server-side search.

Image for post
Image for post

When the pod sends a request to CoreDNS to resolve the domain name:

  • CoreDNS queries pod information based on the source IP address.
  • Then, CoreDNS finds and returns the service that the pod wants to access based on the namespace information of the pod.
  • The number of client requests is reduced from four to one. This reduces CoreDNS requests by 75%, reducing the failure rate.
Image for post
Image for post
  • node-local-dns intercepts DNS queries from pods.
  • node-local-dns distributes external domain name traffic to ensure that external domain name requests are no longer sent to the central CoreDNS.
  • The intermediate link uses the more stable TCP resolution method.
  • Nodes cache DNS resolution results, so fewer requests are sent to the central CoreDNS.
  • During cloud-native DNS resolution, pods request custom DNS capabilities in PrivateZone.
  • After ExternalDNS listens to service and pod creation, it configures domain name resolution in PrivateZone.
  • It provides the same DNS resolution performance as native ECS.
Image for post
Image for post

Summary

This is how Alibaba Cloud designs and builds high-performance, cloud-native pod networks. With cloud-native development, more types of application loads will run on Kubernetes and more cloud services will be integrated into pod scenarios. We believe that more functions and application scenarios will be incubated in high-performance, cloud-native pod networks in the future.

Those who are interested in this topic are welcome to join us.

Q & A

Q1: Is the veth pair used to connect the pod network namespace and the host when a pod network namespace is created?

A1:

  • Shared ENI mode: In kernel 3.10, the veth pair is used to connect namespaces to ensure compatibility. In 4.x kernel versions, such as the aliyunlinux2 kernel 4.19 used on Alibaba Cloud, IPVLAN is used to connect namespaces.
  • Exclusive ENI mode: ENIs are moved to the namespaces of pods, without needing to connect the host and the pod namespace.

Q2: How are the security audits performed when a pod’s IP address is not fixed?

A2:

  • The IP address of a pod remains unchanged in its declaration period. You can find the pod where the IP address is allocated in a Kubernetes event.
  • In Terway NetworkPolicy implementation, pods are identified by labels. When pods are rebuilt, the IP addresses of pods created with the labels are dynamically updated.
  • The IP addresses that Terway configures for pods are relatively fixed. For example, when a Statefulset application is updated on a node, Terway reserves the pod IP address for a period so that it can be quickly used when the pod is restarted. The IP address remains unchanged during the update process.

Q3: Does IPVLAN have high requirements for the kernel?

A3: Yes. On Alibaba Cloud, we can use aliyunlinux2 kernel 4.19. In earlier kernels versions, Terway also supports the veth pair + policy routing method to share secondary IP addresses on the ENI. However, the performance may be low.

Q4: Will the pod startup speed be affected if eBPF is started in a pod? How long does it take to deploy eBPF in a pod?

A4: The eBPF program code is not too large. Currently, it increases the overall deployment time by several hundred milliseconds.

Q5: Is IPv6 supported? What are its implementation problems? Is there kernel or kube-proxy code issues?

A5: Currently, IPv6 addresses can be exposed through LoadBalancer. However, IPv6 addresses are converted to IPv4 addresses in LoadBalancer. Currently, pods do not support IPv6 addresses. IPv6 addresses are supported starting for kube-proxy starting from Kubernetes 1.16. We are actively tracking this issue and plan to implement native pod IPv4/IPv6 dual stack together with Alibaba Cloud IaaS this year.

Q6: The source IP address is used to obtain the accessed service during each CoreDNS resolution request. Is the Kubernetes API called to obtain the accessed service? Will this increase pressure on the API?

A6: No. The preceding shows the structure. CoreDNS AutoPath listens to pod and service changes from the API server using the watch&list mechanism and then updates the local cache.

Q7: Are Kubernetes Service requests sent to a group of pods in polling mode? Are the probabilities of requests to each pod the same?

A7: Yes. The probabilities are the same. It is similar to the round robin algorithm used in load balancing.

Q8: IPVLAN and eBPF seem to be supported only by later kernel versions. Do they have requirements for the host kernel?

A8: Yes. We can use aliyunlinux2 kernel 4.19 on Alibaba Cloud. In earlier kernel versions, Terway also supports the veth pair + policy routing method to share secondary IP addresses on the ENI. However, the performance may be low.

Q9: How does Cilium manage or assign IP addresses? Do other CNI plugins manage the IP address pool?

A9: Cilium has two ways to assign IP addresses. host-local: Each node is segmented and then assigned sequentially. CRD-backend: The IPAM plugin can assign IP addresses. Cilium in Terway only performed network policies and service hijacking and load balancing and did not assign or configure IP addresses.

Q10: Does Cilium inject BPF into the veth of the host instead of veth of the pod? Have you made any changes?

A10: Yes. Cilium modifies the peer veth of the pod. After testing, we found that the performance of IPVLAN is better than veth. Terway uses IPVLAN for a high-performance network without the peer veth. For more information about our modifications for adaptation, please visit this website. In addition, Terway uses Cilium only for NetworkPolicy and Service hijacking and load balancing.

Q11: How does a pod access the cluster IP address of a service after the Terway plugin is used?

A11: The eBPF program embedded into the pod ENI is used to load the service IP address to the backend pod.

Q12: Can you talk about Alibaba Cloud’s plans for service mesh?

A12: Alibaba Cloud provides the Alibaba Cloud Service Mesh (ASM) product. Subsequent developments will focus on ease-of-use, performance, and global cross-region integrated cloud, edge, and terminal connections.

Q13: Will the ARP cache be affected if a pod IP address is reused after the node’s network is connected? Will IP address conflicts exist if a node-level fault occurs?

A13: First, the cloud network does not have the ARP problem. Generally, layer-3 forwarding is adopted at the IaaS layer. The ARP problem does not exist even if IPVLAN is used locally. If Macvlan is used, the ARP cache is affected. Generally, macnat can be used (both ebtables and eBPF can be implemented.) Whether an IP conflict occurs depends on the IP management policy. In the current Terway solution, IPAM directly calls IaaS IPAM, which does not have this problem. If you build the pod network offline yourself, consider the DHCP strategy or static IP address assignment to avoid this problem.

Q14: After the link is simplified using eBPF, the performance is improved significantly (by 31%) compared to iptables and (by 62%) compared to IPVS. Why is the performance improvement relative to IPVS more significant? What if it is mainly for linear matching and update optimization for iptables?

A14: The comparison here is based on one service and concerns the impact of link simplification. In iptables mode, NAT tables are used for DNAT and only the forward process is involved. In IPVS mode, input and output are involved. Therefore, iptables may be better in the case of a single service. However, the linear matching of iptables causes significant performance deterioration when many services are involved. For example, IPVS is better when there are 5,000 services.

Image for post
Image for post

Q15: If I do not use this network solution and think that large-scale service use will affect performance, are there any other good solutions?

A15: The performance loss in kube-proxy IPVS mode is minimal in large-scale service scenarios. However, a very long link is introduced, so the latency will increase a little.

About the Author

Image for post
Image for post

Xiheng is an Alibaba Cloud Technical Expert and maintainer of the Alibaba Cloud open-source CNI plugin Terway project. He is responsible for Container Service for Kubernetes (ACK) network design and R&D.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app