How Does Alibaba Ensure the Performance of System Components in a 10,000-node Kubernetes Cluster?

  • 200,000 pods
  • 1,000,000 objects
  1. etcd encountered read and write latency issues, which lead to a denial of service, with a large number of Kubernetes-stored objects unable to receive support due to space limitations.
  2. The API server encountered extremely high latency when querying pods and nodes, and the back-end etcd sometimes also ran out of memory due to concurrent query requests.
  3. The controller encountered high processing latency because it couldn’t detect the latest changes from the API server. For abnormal restarts, service recovery would often take several minutes.
  4. The scheduler encountered high latency and low throughput, and was unable to meet the routine O&M requirements of Alibaba to support the extreme performance nessary for the shopping festivals run on Alibaba’s various e-commerce platforms.

Improvements to etcd

To solve these problems, the team of engineers at Alibaba Cloud and Ant Finanical made comprehensive improvements to Alibaba Cloud Container Platform so to improve the performance of Kubernetes in these large-scale applications.

  • In the first version, the total storage capacity of etcd was improved by dumping the data stored in etcd to a tair cluster. However, this had the disadvantage that the security of the Kubernetes cluster could be hampered by the complex O&M requirements of the tair cluster, which uses a data consistency model that is not based on a Raft replication group.
  • In the second version, different types of objects on the API server were stored in different etcd clusters. Then, different data directories were created within etcd, and data in different directories was routed to different etcd back ends. This helped to reduce the total data volume stored in a single etcd cluster and improve overall extensibility.
  • In the third version, the team studied the internal implementation principle of etcd and found that the extensibility of etcd was mainly determined by the page allocation algorithm of the underlying bbolt db. As etcd stores increased the amount of data, bbolt db experienced worse performance in linearly searching for the storage page with a run length.

Improvements to the API Server

In this following section, we will look at some of the challenges that the engineering team faced when it came to optimizing the API Server and look at the solutions they proposed. For this section, we will need to go into quite a few technical details, as the engineering team addressed several issues.

Efficient Node Heartbeats

The scalability of Kubernetes clusters depends on the effective processing of node heartbeats. In a typical production environment, the kubelet reports a heartbeat every 10s. The content connected with each heartbeat request reaches 15 KB, including dozens of images on the node and a certain amount of volume information. However, there are two problems with all of this:

  1. The heartbeat request triggers the update of node objects in etcd, which may generate nearly 1 GB of transaction logs per minute in a Kubernetes cluster with 10,000 nodes. The change history is recorded in etcd.
  2. In a node-packed Kubernetes cluster that involves high CPU usage of the API server and huge overhead of serialization and deserialization, the CPU overhead for processing heartbeat requests exceeds 80% of the CPU time usage of the API server.
  1. It updates Lease objects every 10s to indicate the survival status of nodes. The node controller determines whether nodes are alive based on the Lease object status.
  2. In consideration of compatibility, the cluster updates node objects every 60s so that EvictionController works based on the original logic.

Load Balancing for the API Server

In a production cluster, multiple nodes are typically deployed to form high-availability (HA) Kubernetes clusters for performance and availability considerations. During the runtime of an high-availability cluster, the loads among multiple API servers may be unbalanced, especially in the occasion that the cluster is upgraded or nodes restart due to faults. This imbalance creates a great burden for the cluster’s stability. In extreme situations, the burden on the API servers that was originally distributed through high-availability is concentrated again on one node. This slows down the system response and makes the node unresponsive, causing an avalanche.

  1. Add a load balancer on the API server side and connect all Kubernetes clusters to the load balancer. This method is applied to Kubernetes clusters delivered by cloud vendors.
  2. Add a load balancer on the kubelet side and enable the load balancer to select API servers.
  1. They configured the API servers to consider clients as untrusted to prevent overloading due to excessive requests. Then, in the case an API server exceeds a load threshold, it sends 429 error code (to indicate too many requests received) to notify the clients of backoff. When the API server exceeds a higher load threshold, it closes client connections to deny requests.
  2. They configured clients to periodically reconnect to other API servers when frequently receiving the 429 error code within a specific period of time.
  3. At the O&M level, the team upgraded the API servers by setting maxSurge to 3 to prevent performance jitter during upgrades.

List-Watch and Cacher

List-Watch is essential for server-client communication in Kubernetes. API servers watch the data changes and updates of objects in etcd through a reflector and store the changes and updates in the memory. Controllers and Kubernetes clients subscribe to data changes through a mechanism similar to List-Watch.

  1. Pod storage
  2. Node storage
  3. Configmap storage

Cacher and Indexing

In addition to List-Watch, clients can access API servers through a direct query, as shown in the above figure. API servers take care of the query requests of clients by reading data from etcd to ensure that consistent data is retrieved by the same client from multiple API servers. However, with this system, the following performance problems occur:

  1. Indexing is not supported. The pod for node query must first obtain all the pods in the cluster, which results in a huge overhead.
  2. If the amount of data queried by a single request is large, too much memory is consumed due to the request-response model adopted by etcd. The amount of requested data is limited by the queries that API servers make to etcd. Queries of massive data are completed through pagination, which results in multiple round trips that greatly reduce performance.
  3. To ensure consistency, API servers use Quorum read to query etcd. The query overhead is measured at the cluster level and cannot be expanded.
  1. The client queries the API servers at time t0.
  2. The API servers request the current data version rv@t0 from etcd.
  3. The API servers update their request progress and wait until the data version of the reflector reaches rv@t0.
  4. A response is sent to the client through the cache.

Context-Aware Problems and Optimizations

To understand what are content-aware problems, first consider the following situation. The API Server needs to receive and complete requests to have access to external services. This could be, for example, to access to the etcd for persistent data, or the Webhook Server for extended Admission or Auth, or even the API Server itself.

Request Flood Prevention

The API Server is also vulnerable to receiving too many request. This is because, even though the API Server comes with the max-inflight filter, which limits maximum concurrent read/wwrites, there really aren’t any other filters for restricting the number of requests to the Server, making it vulnerable to crashes.

  • Issues during API Server restarts or updates
  • The Client has a bug that causes an abnormal number of requests

Controller Failover

In a 10,000-node production cluster, one controller stores nearly one million objects. It consumes substantial overhead to retrieve these objects from the API server and deserialize them, and it may take several minutes to restart the controller for recovery. This is unacceptable for enterprises as large as Alibaba. To mitigate the impact on system availability during component upgrade, the team of engineers at Alibaba developed a solution to minimize the duration of system interruption due to controller upgrade. The following figure shows how this solution works.

  1. Controller informers are pre-started to pre-load the data required by the controller.
  2. When the active controller is upgraded, Leader Lease is actively released to enable the standby controller to immediately take over services from the active controller.

Customized Schedulers

Due to historical reasons, Alibaba schedulers adopt a self-developed architecture. Enhancements made to schedulers are not described in detail here due to length constraints. Rather, only two basic ideas — both nonetheless important ideas — are described here, which are also shown in the figure below.

  1. Equivalence classes: A typical capacity expansion request is intended to scale up multiple containers at a time. Therefore, the team divided the requests in a pending queue into equivalence classes to batch-process these requests. This greatly reduced the number of predicates and priorities.
  2. Relaxed randomization: When a cluster provides many candidate nodes for serving one scheduling request, you only need to select enough nodes to process the request, without having to evaluate all nodes in the cluster. This method can improve the scheduling performance at the expense of solving accuracy.


Through a series of enhancements and optimizations, the engineering teams at Alibaba and Ant Financial were able to successfully apply Kubernetes to the production environments of Alibaba and reach the ultra-large scale of 10,000 nodes in a single cluster. To sum things up, specific improvements that were made include the following:

  1. The storage capacity of etcd was improved by separating indexes and data and sharding data. The performance of etcd in storing massive data was greatly improved by optimizing the block allocation algorithm of the bbolt db storage engine at the bottom layer of etcd. Last, the system architecture was simplified by building large-scale Kubernetes clusters based on a single etcd cluster.
  2. The List performance bottleneck was removed to enable the stability of operation of 10,000-node Kubernetes clusters through a series of enhancements, such as Kubernetes lightweight heartbeats, improved load balancing among API servers in high-availability (HA) clusters, List-Watch bookmark, indexing, and caching.
  3. Hot standby mode greatly reduced the service interruption time during active/standby failover on controllers and schedulers. This also improved cluster availability.
  4. The performance of Alibaba-developed schedulers was effectively optimized through equivalence classes and relaxed randomization.

Original Source



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: