How Does Alibaba Ensure the Performance of System Components in a 10,000-node Kubernetes Cluster?

  • 200,000 pods
  • 1,000,000 objects
  1. etcd encountered read and write latency issues, which lead to a denial of service, with a large number of Kubernetes-stored objects unable to receive support due to space limitations.
  2. The API server encountered extremely high latency when querying pods and nodes, and the back-end etcd sometimes also ran out of memory due to concurrent query requests.
  3. The controller encountered high processing latency because it couldn’t detect the latest changes from the API server. For abnormal restarts, service recovery would often take several minutes.
  4. The scheduler encountered high latency and low throughput, and was unable to meet the routine O&M requirements of Alibaba to support the extreme performance nessary for the shopping festivals run on Alibaba’s various e-commerce platforms.

Improvements to etcd

  • In the first version, the total storage capacity of etcd was improved by dumping the data stored in etcd to a tair cluster. However, this had the disadvantage that the security of the Kubernetes cluster could be hampered by the complex O&M requirements of the tair cluster, which uses a data consistency model that is not based on a Raft replication group.
  • In the second version, different types of objects on the API server were stored in different etcd clusters. Then, different data directories were created within etcd, and data in different directories was routed to different etcd back ends. This helped to reduce the total data volume stored in a single etcd cluster and improve overall extensibility.
  • In the third version, the team studied the internal implementation principle of etcd and found that the extensibility of etcd was mainly determined by the page allocation algorithm of the underlying bbolt db. As etcd stores increased the amount of data, bbolt db experienced worse performance in linearly searching for the storage page with a run length.

Improvements to the API Server

Efficient Node Heartbeats

  1. The heartbeat request triggers the update of node objects in etcd, which may generate nearly 1 GB of transaction logs per minute in a Kubernetes cluster with 10,000 nodes. The change history is recorded in etcd.
  2. In a node-packed Kubernetes cluster that involves high CPU usage of the API server and huge overhead of serialization and deserialization, the CPU overhead for processing heartbeat requests exceeds 80% of the CPU time usage of the API server.
  1. It updates Lease objects every 10s to indicate the survival status of nodes. The node controller determines whether nodes are alive based on the Lease object status.
  2. In consideration of compatibility, the cluster updates node objects every 60s so that EvictionController works based on the original logic.

Load Balancing for the API Server

  1. Add a load balancer on the API server side and connect all Kubernetes clusters to the load balancer. This method is applied to Kubernetes clusters delivered by cloud vendors.
  2. Add a load balancer on the kubelet side and enable the load balancer to select API servers.
  1. They configured the API servers to consider clients as untrusted to prevent overloading due to excessive requests. Then, in the case an API server exceeds a load threshold, it sends 429 error code (to indicate too many requests received) to notify the clients of backoff. When the API server exceeds a higher load threshold, it closes client connections to deny requests.
  2. They configured clients to periodically reconnect to other API servers when frequently receiving the 429 error code within a specific period of time.
  3. At the O&M level, the team upgraded the API servers by setting maxSurge to 3 to prevent performance jitter during upgrades.

List-Watch and Cacher

  1. Pod storage
  2. Node storage
  3. Configmap storage

Cacher and Indexing

  1. Indexing is not supported. The pod for node query must first obtain all the pods in the cluster, which results in a huge overhead.
  2. If the amount of data queried by a single request is large, too much memory is consumed due to the request-response model adopted by etcd. The amount of requested data is limited by the queries that API servers make to etcd. Queries of massive data are completed through pagination, which results in multiple round trips that greatly reduce performance.
  3. To ensure consistency, API servers use Quorum read to query etcd. The query overhead is measured at the cluster level and cannot be expanded.
  1. The client queries the API servers at time t0.
  2. The API servers request the current data version rv@t0 from etcd.
  3. The API servers update their request progress and wait until the data version of the reflector reaches rv@t0.
  4. A response is sent to the client through the cache.

Context-Aware Problems and Optimizations

Request Flood Prevention

  • Issues during API Server restarts or updates
  • The Client has a bug that causes an abnormal number of requests

Controller Failover

  1. Controller informers are pre-started to pre-load the data required by the controller.
  2. When the active controller is upgraded, Leader Lease is actively released to enable the standby controller to immediately take over services from the active controller.

Customized Schedulers

  1. Equivalence classes: A typical capacity expansion request is intended to scale up multiple containers at a time. Therefore, the team divided the requests in a pending queue into equivalence classes to batch-process these requests. This greatly reduced the number of predicates and priorities.
  2. Relaxed randomization: When a cluster provides many candidate nodes for serving one scheduling request, you only need to select enough nodes to process the request, without having to evaluate all nodes in the cluster. This method can improve the scheduling performance at the expense of solving accuracy.


  1. The storage capacity of etcd was improved by separating indexes and data and sharding data. The performance of etcd in storing massive data was greatly improved by optimizing the block allocation algorithm of the bbolt db storage engine at the bottom layer of etcd. Last, the system architecture was simplified by building large-scale Kubernetes clusters based on a single etcd cluster.
  2. The List performance bottleneck was removed to enable the stability of operation of 10,000-node Kubernetes clusters through a series of enhancements, such as Kubernetes lightweight heartbeats, improved load balancing among API servers in high-availability (HA) clusters, List-Watch bookmark, indexing, and caching.
  3. Hot standby mode greatly reduced the service interruption time during active/standby failover on controllers and schedulers. This also improved cluster availability.
  4. The performance of Alibaba-developed schedulers was effectively optimized through equivalence classes and relaxed randomization.

Original Source




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Rule #76: No excuses, debug like a champion!

From Confused to Proficient: Analysis of Failure to Delete a Kubernetes Cluster Namespace

Defining and fine tuning an API in Laravel with the Stoplight Platform

Highly Secure Wordpress-MySQL Automation using AWS & Terraform

Crowdsourcing Your Continuing Education

[LeetCode]#226. Invert Binary Tree

How to Manage Azure Key Vault with Terraform

Functional vs Non Functional Requirements for Your Project | Eastern Peak — Technology Consulting…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Newsletter of Carlos Santana — Issue #26

Getting started with Prometheus Federation in Docker

Gitlab CI runner on Kubernetes cluster

Monitor Uptime of Endpoints in K8s using Blackbox Exporter