Flexible and Efficient Cloud-native Cluster Management Experience with Kubernetes
By Huaiyou, Linshi
A single Kubernetes cluster provides users with namespace-level isolation. Theoretically, this type of Kubernetes cluster supports no more than 5,000 nodes and 150,000 pods. Multi-Kubernetes clusters address the difficulties in resource isolation and fault isolation of single clusters. The number of nodes and pods usually supported is no longer limited. However, the complexity of cluster management increases. Especially in Apsara Stack scenarios, Kubernetes engineers cannot reach customer environments as quickly as in public cloud scenarios, so the operation and maintenance (O&M) costs are further increased. Therefore, how to manage multiple Kubernetes clusters in an automated, highly efficient manner and at low cost has become a common challenge in the industry.
Multi-cluster application scenarios are as follows:
1) The product itself requires multi-cluster capabilities. Product control needs to be deployed in a Kubernetes cluster. This product also needs to provide Kubernetes clusters to users. From the perspective of fault isolation, stability, and security, the control and business of container services must be deployed in different clusters.
2) Users also hope that, when they use Kubernetes, it allows them to produce multiple clusters for different businesses, to isolate resources and faults.
3) A user may need the capabilities of multiple types of clusters. Take the edge computing IoT as an example. It requires a customized edge Kubernetes cluster. If you create this edge cluster based on a common independent Kubernetes cluster, there will be a waste of resources. In addition, independent clusters increase O&M costs for users.
The following sections summarize the difficulties in operating and maintaining the Kubernetes cluster into two parts.
Difficulty 1) Maintaining the Control Plane of Kubernetes Clusters
- How do I enable users to add a new Kubernetes cluster quickly?
- How do I upgrade the versions of multiple Kubernetes clusters? Do I need to upgrade the clusters one by one when the community detects major CVE issues?
- How do I automatically fix faults that occur when multiple Kubernetes clusters are running?
- How do I maintain etcd in a cluster, including performing upgrades, backup, restoration, and node migration?
Difficulty 2) Maintaining Worker Nodes
- How do I quickly scale Worker nodes up or down? At the same time, the version and configuration of the on-host software (such as Docker and Kubelet that cannot be managed by Kubernetes) on each node must be consistent with those on other nodes.
- How do I upgrade the on-host software on several Worker nodes? How do I implement a phased release of software packages on Worker nodes?
- How does Kubernetes automatically resolve possible on-host software faults on several Worker nodes? Can we automate the process of handling a panic that occurs in Docker or Kubelet?
As a complex automated O&M system, Kubernetes supports the release, upgrade, and lifecycle management of upper-layer businesses. However, in the industry, Kubernetes O&M is usually performed by a workflow such as Ansible or Argo. The O&M process is not highly automated and requires O&M personnel to have professional Kubernetes knowledge. If multiple Kubernetes clusters require O&M, the O&M costs increase linearly even under ideal conditions, and the costs will significantly increase in Apsara stack scenarios.
Alibaba Group encountered the challenges of managing many Kubernetes clusters a long time ago. Therefore, we abandoned the traditional workflow model and explored another solution: Use Kubernetes to manage Kubernetes clusters. For more information, see the CNCF article Demystifying Kubernetes as a Service — How Alibaba Cloud Manages 10,000s of Kubernetes Clusters, which introduced Alibaba Cloud’s exploration and experience in managing large-scale Kubernetes clusters.
Of course, an Apsara stack scenario is quite different from Alibaba Group’s internal scenario. The scale effect of Kube-On-Kube (KOK) is the crucial reason why it is used in Alibaba Group’s internal scenario. You may use one Meta Kubernetes cluster to manage thousands of business Kubernetes clusters. The scale effect in the Apsara stack scenarios is small. Apsara stack uses KOK to automate the O&M of Kubernetes clusters, and KOK is compatible with various Kubernetes clusters. This improves stability and enriches application scenarios.
Declarative Kubernetes O&M Philosophy: Use Kubernetes to Manage Kubernetes Clusters
The declarative API of Kubernetes changes the traditional procedural O&M mode. It corresponds to the final-state-oriented O&M mode: Users define their desired states in the Spec, and the Kubernetes controller performs a series of operations to help users achieve the desired state. If the requirements are not met, the controller keeps trying.
For Kubernetes native resources such as Deployment, the Kubernetes controller is responsible for maintaining the final states. For user-defined resources, such as a cluster, Kubernetes provides a powerful and easy-to-use CRD + Operator mechanism. Customize final-state-oriented O&M tasks in a few simple steps:
1) Define your own resource type (CRD) and implement your own Operator, which contains a custom Controller.
2) Submit a CR file in yaml or json format.
3) The Operator detects the CR’s change, and the Controller starts to execute the corresponding O&M logic.
4) During the running process, if the final state does not meet the requirements, the Operator monitors the change and performs corresponding recovery operations.
The Operator is one of the best practices that use code to perform O&M on applications. Of course, it is just a framework that eliminates some repetitive work, such as event listening and RESTful API listening. However, the core O&M logic still needs to be written case by case.
Cloud-native KOK Architecture
KOK is not a new concept. There are many excellent solutions in the industry, including by Ant Financial and other community projects.
However, the preceding solutions are highly dependent on Alibaba Cloud infrastructure. Apsara stack has the following features we need to consider:
- Users prefer products that are lightweight enough, and they generally do not accept the costs of exclusive management nodes.
- Differentiated infrastructures need to be independent of public cloud and Alibaba Group’s internal infrastructures.
- Cloud-native architecture should be adopted.
After considering these three factors, we have designed a more general KOK solution as shown below.
- Meta Cluster is the lower level Kube of the Kube-On-Kube.
- Production Cluster is the business cluster, the upper layer Kube of the Kube-On-Kube.
- etcd cluster is created and maintained by the etcd operator running in the meta cluster. Each business cluster occupies an etcd cluster exclusively, or multiple business clusters share an etcd cluster.
- PC-master-pod is the control pod of the business clusters. It is represented by three pods: Apiserver, Scheduler, and Controller Manager. These pods are maintained by the Cluster Operator running in the Meta Cluster.
- PC-nodes is the business cluster node. It is initialized, added to a business cluster, and maintained by a Machine Operator.
The etcd Operator creates, destroys, upgrades, and recovers etcd clusters. It also monitors the status of etcd clusters, including the cluster health status, member health status, and storage data volume.
SRE is the cloud-native application platform of Alibaba Cloud. The SRE team improved the open-source version of etcd Operator and enhanced its O&M and stability. The etcd Operator is responsible for O&M and management of a large number of etcd clusters in Alibaba Cloud. The operator is stable and has been proven to be successful.
The Cluster Operator is responsible for creating and maintaining the Kubernetes management and control components, including Apiserver, Controller Manager, and Scheduler. It is also responsible for generating corresponding certificates and kubeconfig.
We worked with the PaaS engine and Apsara stack product team from Ant Financial group to create the Cluster Operator, which provides features such as custom rendering, version tracking, and dynamic addition of supported versions.
The Kubernetes management components of a business cluster are deployed in a meta cluster. From a meta cluster perspective, these components are common resources, including Deployment, Secret, Pod, and PVC. Therefore, business clusters do not have the concept of Master nodes:
- kube-apiserver: consists of one Deployment and one Service. The kube-apiserver is stateless and the Deployment meets the requirement. The Apiserver must be connected to the etcd cluster started by the etcd Operator. The Service is used to expose the services of Apiserver to other components of the business clusters and to external components. In this case, if the LoadBalancer Service is available in your environment, we recommend using LoadBalancer to expose the Apiserver Service. If it is unavailable, we provide an exposed form of NodePort.
- kube-controller-manager: one Deployment, stateless application
- kube-scheduler: one Deployment, stateless application
However, Kubernetes is not available if only the preceding three components are deployed. The following requirements must be met:
1) In addition to etcd and the three major components, coredns, kube-proxy, and other components must be deployed for an available service Kubernetes.
2) Some components must be deployed on the same meta cluster as etcd and the three major components. For example, a Machine Operator is responsible for starting nodes, and it must be ready to run before nodes are available in the business cluster. Therefore, it cannot be deployed in a business cluster.
3) Components need to be upgraded.
To meet the need for scalability, we designed the Addons plug-and-play function, which allows importing all Addons components through only one command. In addition, Addons supports dynamic rendering and allows customizing Addons configuration items. The details will not be elaborated here.
The Machine Operator performs necessary initialization operations on a node; creates node components such as docker, Kubernetes, and NVIDIA; maintains the final state, and adds the node to a business cluster.
We use the KubeNode component maintained by the Serverless node management team for the cloud-native application platform of Alibaba Cloud. The Operator is responsible for connecting and disconnecting nodes in the Alibaba Group to implement a final-state-oriented O&M framework. Customize the O&M CRD for different Arch databases or operating systems (OSs), which are more suitable for Apsara stack environments that change constantly.
In short, KubeNode provides a final-state-oriented O&M framework, which consists of two concepts: the Component and the Machine.
1) You provide the O&M script based on the template to generate the Component CR.
2) If you want to bring online a node, generate a Machine CR, which specifies the components that need to be deployed.
3) When KubeNode detects the Machine CR, it performs O&M operations on the corresponding node.
In theory, this design allows applications to be extended based on different Arch databases or OSs, without modifying the Operator source code. Therefore, it is highly flexible. At the same time, we are also exploring how to integrate IaaS providers to ultimately achieve the goal of RunAnyWhere.
Comparison of the Costs for Multi-Cluster Solutions
After using the automation tool (also known as a cloud-native Operator) to connect to the preceding processes, it’s possible to reduce a cluster’s production time to minutes.
The following table compares the costs of a tiled multi-cluster solution (direct deployment of multiple sets) and a KOK multi-cluster solution.
T is the deployment time of a single Kubernetes cluster, t is the deployment time of a single business cluster, K is the number of clusters, G is the number of sites, U is the time of meta cluster upgrade, u is the time of business cluster upgrade, and P is the number of upgrades.
According to our practical experience, T and U are about 1 hour, and t and u are about 10 minutes under normal circumstances. We predict that:
- The time for delivering multiple clusters (three clusters) will decrease from more than three hours to less than one hour.
- The time for upgrading a cluster will decrease from more than one hour to 10 minutes.
- The time for creating a new cluster will decrease from more than two hours to 10 minutes.
The tiled multi-cluster solution increases O&M complexity linearly. However, the KOK multi-cluster solution treats the Kubernetes cluster as a Kubernetes resource. By using the powerful CRD and Operator capabilities of Kubernetes, the O&M of Kubernetes clusters is upgraded from the traditional procedural type to the declarative type. In this way, the O&M complexity of Kubernetes clusters is significantly reduced.
In addition to years of O&M experience in Alibaba Group, the multi-cluster design described in this article adopts the cloud-native architecture to implement RunAnyWhere, without relying on differentiated infrastructures. Just provide common IaaS to implement easy-to-use, stable, and lightweight Kubernetes multi-cluster capabilities.