Operating Large Scale Kubernetes Clusters on Alibaba X-Dragon Servers

A Quick Run-Down of the Architecture Used

For 2019’s Double 11 shopping event, Alibaba’s systems relied on five resource units distributed across three geographic locations. Of the five resources, three were running completely on public cloud, specifically Alibaba Cloud, and the other two were running in a hybrid cloud environment. Among these, one of the most important were the X-Dragon bare metal servers, whose current generation already withstood the major traffic spikes seen during the midyear June 18 and autumn September 9 promotions on Alibaba’s various e-commerce platforms, proving to Alibaba engineers they could easily take on Double 11, too. And, with that, all three resources running on the cloud for the 2019 Double 11 Shopping Festival were running specifically on the infrastructure provided by X-Dragon bare metal servers. This included tens of thousands of Kubernetes clusters.

Why We Ended up Using X-Dragon for Double 11

The virtualization technology, X-Dragon, behind Alibaba Cloud Elastic Compute Service (ECS) is now in its third generation. The first two generations were called Xen and KVM. X-Dragon was specifically developed by a dedicated team of engineers for Alibaba Cloud’s sixth generation of ECS instances. The X-Dragon architecture provides the following four advantages:

  1. The storage and network virtual VMM and ECS control are decoupled from the related computing virtualization for better optimizations.
  2. Computing virtualization has further evolved to the Near Metal Hypervisor.
  3. The acceleration of storage and network VMM is realized with dedicated chips.
  4. Alibaba Cloud Elastic Compute Service (ECS) virtual machines and ECS Bare Metal instances are used as the main computing infrastructure.

Combining X-Dragon with Containers and Kubernetes

In the all in the cloud era, enterprise IT architectures are being reshaped and transformed, with cloud native becoming the quickest way to realize the value of cloud computing. For last year’s Double 11 Shopping Festival, Alibaba moved its core system to a cloud-native platform, which was built on X-Dragon bare metal servers, lightweight cloud-native containers, and a Kubernetes-compatible orchestration platform know as the Alibaba Serverless Infrastructure (ASI). Kubernetes pods are well integrated with X-Dragon bare metal servers, as they can deliver business performance while X-Dragon bare metal instances provide the underlying power.

  • ASI pods run on X-Dragon bare metal nodes and offload network and storage virtualization to standalone hardware MOC cards, which use FPGA chip acceleration technology to achieve superior performance to physical servers and ECS instances. MOCs have their own operating system and kernel, allowing them to allocate CPU cores to AVS (networking) and TDC (storage).
  • An ASI pod consists of a Main container that hosts the service, a star-agent sidecar container for O&M tasks, and other supplementary containers, such as the local cache container of an application. Each pod uses the Pause container to share the same network namespace, UTS namespace, and PID namespace. However, ASI disables PID namespace sharing.
  • The Main container and the O&M container of a Pod share the same data volume, which is declared as a cloud disk through PVC and mounted to the corresponding cloud disk mount point. In the storage architecture of the ASI, each pod has an independent cloud disk space, which supports read/write isolation and limits the disk size.
  • An ASI pod uses the Pause container to directly connect to the ENI on the MOC.
  • No matter how many containers are in an ASI pod, the pod only uses its assigned resources, such as 16 CPU cores, 60 GB of RAM and 60 GB of storage space.

How the Large-Scale X-Dragon O&M Works

Tens of thousands of X-Dragon clusters were used to handle core transactions during the 2019 Double 11 Shopping Festival, which represented a significant challenge to our O&M personnel. The tasks they faced included instance specification selection for all cloud services, large-scale cluster elastic scaling, node resource assignment and control, as well as core metric collection and analysis. Besides these, it also involved such things as infrastructure monitoring, downtime analysis, node label management, node reboot, lock, and release, auto-repair, as well as faulty node rotation, kernel patching and upgrades, and large-scale inspections.

Instance Specification Selection

First up, we planned different instance specifications for different services, such as the ingress layer, core service system, middleware, database, and cache. Different services have different characteristics and require different hardware setups. Some require high-performance computing power, others require high packet throughput networking, yet others require high-performance disk read/write capabilities. These needed to be designed and planned beforehand so that insufficient resources would not impact service performance and stability. The specification of an instance includes its vCPU, memory, ENI quantity, cloud disk quantity, system disk size, data disk size, and packet throughput (PPS).

Resource Elasticity

Alibaba’s annual Double 11 shopping promotion brings about some massive traffic peaks. To cope with the traffic, every year we need massive volumes of computing resources. However, it would be wasteful to retain these resources for daily use. Therefore, we created a cluster group for daily operations and another for promotional events. Several weeks prior to Double 11, we started expanding the cluster group responsible for promotional events through our elastic scaling capability by applying for a large number of X-Dragon instances and scaled out the Kubernetes pods to handle the upcoming transactions. Immediately after the Double 11 Shopping Festival, the pod containers in the promotional events cluster were removed in batches, and the X-Dragon instances were removed. Only the cluster group for daily operations was retained. This method allowed us to be able to significantly reduce the cost of promotional events. After the migration to the cloud, the time from applying for an X-Dragon instance to its creation was reduced from hours or days to mere minutes. Now, we can apply for thousands of X-Dragon instances, including their computing, network, and storage resources, in five minutes and see them created and imported into Kubernetes clusters in 10 minutes. This leap in cluster creation efficiency will serve as the foundation for future elastic resource scaling.

Core Metrics

There are three core metrics used to measure the overall health of a large scale X-Dragon cluster: uptime, downtime, and rate of successful orchestration.

Label Management

As the cluster size increases, management becomes more difficult. For example, how can we filter out all the instances in the cn-shanghai region with the type ecs.ebmc6-inc.26xlarge? Well, to facilitate such operations, we use predefined labels to manage resources. In the Kubernetes architecture, you can define labels to manage instances. Each label is a Key-Value pair. You can use the standard Kubernetes API to assign labels to X-Dragon instances. For example, we used sigma.ali/machine-model: ecs.ebmc6-inc.26xlarge to represent the instance type and sigma.ali/ecs-region-id: cn-shanghai to represent the region. With this label management system, we can quickly locate the instances we want from tens of thousands of X-Dragon instances and perform routine O&M operations such as batch canary releases, batch restarts, and batch releases.

Downtime Analysis

Downtime is inevitable for large-scale clusters. That’s just a fact. It’s how you collect downtime statistics and analyze them that makes a difference. Many things can cause downtime, including hardware faults and kernel bugs. Once downtime occurs, services are interrupted and some stateful applications are also affected. We use SSH and port ping inspection to monitor the downtime of the resource pool and collect historical downtime trends. Alarms are triggered if there is a sudden uptick in downtime. At the same time, we conduct association analysis on the faulty instances. For example, we look at their data center, unit, and group to find out if the downtime was caused by a specific data center. Or, maybe it is related to certain hardware specifications, such as the model or CPU configuration. Downtime can even be related to software problems, such as the operating system or kernel version.

Node Recovery

Running large-scale X-Dragon clusters means you are going to encounter hardware and software faults. The more complex the technology stack, the more complicated the faults will be. It is unrealistic to simply rely on manual work to solve these issues. Automation is needed. Our 1–5–10 node recovery technology is able to discover there is a problem in 1 minute, locate the sources of the problem in 5 minutes, and recover the faulty nodes in 10 minutes. Some of the typical causes of downtime for X-Dragon instances include host machine downtime, high host load, full disk, too many open files, and unavailable core services such as Kubelet, Pouch, or Star-Agent. These problems can usually be solved by recovery operations such as machine reboot, container eviction, software reboot, and disk auto-clean. Over 80% of issues can be solved by using machine reboot and evicting the container to other nodes. In addition, we can automatically repair the NC-side system or hardware faults by monitoring two system events, X-Dragon Reboot and Redeploy.

Looking Ahead towards This Year’s Double 11

For this year’s Double 11 shopping festival, all of Alibaba infrastructure will be in a cloud native architecture that is going to be based on Kubernetes, and the next-generation hybrid deployment architecture based on runV security containers will be implemented on a large scale. As such, we will take lightweight container architecture to the next level.

Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com