How Kubernetes and Cloud Native’s Working out in Alibaba’s Systems
By Zhang Zhen (Shouchen), Senior Technical Expert for Alibaba Cloud’s Cloud-native Application Platform.
By in large, Kubernetes has become the de facto standard for container orchestration engines and nearly everything else that’s cloud native. Up to now, Kubernetes has been implemented at Alibaba Group for several years, but the way we used Kubernetes has changed a lot, having gone through four distinct stages:
- R&D and exploration: In the second half of 2017, Alibaba Group began to experiment on the Kubernetes API for work on creating our own in-house platform and also to adapt our application delivery system to Kubernetes.
- On-premises phased release: In the second half of 2018, Alibaba Group and sister company Ant Financial jointly invested in the research and development of the Kubernetes technology ecosystem to replace Alibaba’s in-house platform with one that uses Kubernetes. And with the success of a small-scale implementation, services using Kubernetes provided a certain level of support during 2018’s Double 11 shopping promotion.
- Off-premises phased release: In early 2019, the whole of Alibaba Group, including all of its e-commerce platforms, began to undergo a comprehensive cloud-based transformation. By redesigning the Kubernetes implementation plan, Alibaba Group adapted to the cloud environment and transformed outdated O&M practices. Alibaba also completed the small-scale verification of off-premises data centers before the midyear 6.18 shopping festival, which happened to be bigger than any other year previous.
- Large-scale implementation: After 2019’s 6.18 shopping event, Alibaba Group began ramp off implementation of Kubernetes throughout its systems, and before the 2019 Double 11, the group had already achieved the goal of running all core applications on Kubernetes, addressing the challenge from Double 11.
Throughout the process, however, one problem still lingers on the minds of the architects. In such a massive and complex business as Alibaba Group, a large number of previous O&M methods remain along with the current O&M systems that support these methods. In this context, what methods are required for implementing Kubernetes, and what should be compromised or changed moving forward?
With all of this in mind, in this article, I’ll share, from the inside, Alibaba’s thoughts regarding these issues. In short, introducing Kubernetes wasn’t the ultimate goal of improving or advancing Alibaba’s systems. Rather, we implemented Kubernetes simply to drive the transformation of our business because, with the capabilities of Kubernetes, we can resolve the deep-seated problems in our old O&M system and unleash the elasticity of the cloud, and accelerate the delivery of business applications. However, as to whether to keep our old O&M systems and what changes should we do moving forward, there’s still a lot that can be discussed.
Final State-oriented Transformation
In Alibaba’s original O&M system, the platform-as-a-service (PaaS) system makes application changes by creating operation tickets, initiating workflows, and then making serial changes to the container platform.
When an application is released, PaaS system searches for all the containers related to the application in the database, and then makes a change to the container platform to modify the container image for each container. Each change is somewhat of a workflow that involves pulling images, stopping old containers, and creating new containers. And, if an error or timeout occurs during such a workflow, then the PaaS system will have to retry the failed operation to correct it. Generally speaking, then, to ensure that a ticket can be closed in a timely manner, the retry is performed only a few times. If all these retries fail, then manual processing is required at that point.
When an application is scaled in, PaaS performs deletion operations according to the container list that is specified based on the input from the O&M personnel. If a container fails to be deleted due to a host exception, or the deletion times out, then PaaS has to retry the operation repeatedly. So, to ensure the closing of the ticket, the container is considered to be deleted after a certain number of retries. However, if the host recovers later, then, in reality, the deleted container may still be running.
Therefore, the conclusion we can draw is that, with the older system, process-oriented container changes suffer from the following permanent problems:
1. A single failed change leads to final failure. For example, if a container image change fails, the PaaS system cannot guarantee the consistency of the final container image. And, if a container fails to be deleted, it is impossible to ensure that the container has been fully deleted. In both examples, inconsistent container problems must be handled through manual inspection. However, rarely executed inspection tasks cannot ensure an absolute lack of errors or promptness of action either.
2. Multiple changes may conflict with each other. For example, application release and application scale-out must be locked. Otherwise, a newly scaled container image may not be updated. Once a change is locked, the efficiency of the change will be greatly affected.
Self-healing Capability Transformation
In Alibaba’s old O&M system, the container platform only produces resources. Application launch and service discovery are performed by the PaaS system after containers are launched. As such, this layered method provides the greatest freedom to the PaaS system. At the same time, it also promoted the prosperity of Alibaba’s initial container ecosystem after containerization first happened. However, this method has a serious problem. For example, the container platform cannot trigger container scaling alone, but needs to build a complex linkage with PaaS. In addition, the upper-layer PaaS has to complete lots of repetitive work. All of this will hinder the container platform from efficient self-recovery when the host fails or restarts, or when processes in the container become abnormal or stuck. Moreover, it makes elastic scaling extremely complex. This naturally could be a major problem.
With Kubernetes, container commands and lifecycle hooks can be used to build PaaS processes for application launch and application launch status check into a pod. In addition, by creating a service object, you can then associate containers with corresponding service discovery mechanisms to unify the lifecycles of containers, applications, and services. The container platform not only provides production resources, but also delivers services that can be directly used by the business. This greatly simplifies the implementation of fault recovery and automatic scaling after cloud migration, taking full advantage of the cloud’s elasticity.
In addition, when the host fails, the PaaS system needs to scale out the application before deleting the container on the host. However, in large-scale clusters, it is difficult to scale out applications in most cases. The application resource quota may be insufficient, or idle resources in the cluster that meet the application scheduling restrictions may be insufficient. Containers on the host cannot be evicted if scale-out is impossible. As a result, the abnormal host cannot be fixed, and over time, the entire cluster can contain a number of faulty servers that can neither be fixed nor removed.
In Kubernetes, the handling of faulty servers is much more simple and crude. Instead of first scaling out applications, the container on the faulty server is directly deleted, and the load controller performs scale-out only after the deletion is completed. At first glance, this solution seems daring. When it was implemented in Kubernetes, many PaaS followers showed strong objection to this method, because they believe it could seriously affect business stability. In fact, most core business applications reserve certain redundant capacity to implement global traffic switching or to deal with business traffic bursts. In other words, temporarily deleting a certain number of containers does not lead to insufficient capacity.
Instead, the key is to determine the available capacity of an application. This is a more difficult problem. However, accurate capacity assessment is not required for self-healing scenarios. In this case, a pessimistic estimate that can trigger self-recovery would be enough. In Kubernetes, you can use
PodDisruptionBudget to quantitatively describe the number of migrated pods for an application. For example, you can set the number or proportion of pods to be simultaneously evicted for the application. You can set this value based on the proportion of each batch at the time of release. If an application is normally released in 10 batches, you can set
PodDisruptionBudget to 10%. For the proportion, if only 10 or less pods exist for an application, Kubernetes still considers that one pod can be evicted. If the application even disallows the eviction of a single pod, such an application needs to be transformed before it can enjoy the benefits of migrating to the cloud. Generally, an application can automate application O&M operations by modifying its own architecture or by using operators to allow pod migration.
Immutable Infrastructure Transformation
The emergence of Docker provided a unified application delivery pattern. The binary code, configurations, and dependencies of applications are packaged into an image during the building process, and application changes are made by using a new image to create a container and deleting the old container. The major difference between the delivery methods of Docker and traditional software package-or script-based delivery methods for delivering applications is that it forces the container to be immutable. To change the container, you must create a new container, with each new container being created from the same image of the application. This ensures consistency and avoids configuration drift or snowflake servers.
Kubernetes further strengthens the concept of immutable infrastructure. In the default rolling upgrade process, it neither changes containers nor changes pods. Each release is done by creating a new pod and deleting an old pod, which not only ensures the consistency of application images, but it also ensures that data volumes, resource specifications, and system parameter configurations remain consistent with the spec of the application template. In addition, many applications have complex structures. One pod may contain components that are individually developed by different teams. For example, an application may include business-related application servers, log collection processes developed by the infrastructure team, and even third-party middleware components. To release these processes and components separately, you cannot place them in the same application image. Therefore, Kubernetes provides the support for multi-container pods, allowing multiple containers to be orchestrated in a pod. In this way, you can publish a single component simply by modifying the container image.
However, Alibaba’s traditional containers are rich containers. That is, application servers and log collection processes are all deployed in the same large system container. As a result, the resource consumption of components such as log collection cannot be individually controlled, and independent upgrade is impossible. Therefore, during this cloud migration at Alibaba, all components in system containers, except business applications, were split into independent sidecar containers. We call this process a lightweight container transformation. After the transformation, a pod contains a master container that runs business, an O&M container that runs various infrastructure agents, and a sidecar that runs service meshes. After lightweight containerization, the master container can run a business service with a relatively low overhead, which facilitates serverless transformation.
However, the default rolling upgrade process of Kubernetes is too rigid for the implementation of the immutable infrastructure concept, which results in a serious lack of support for multi-container pods. Although multiple containers can be deployed in a single pod, releasing a container in the pod not only rebuilds the container to be released during the actual release, but also deletes and then reschedules and rebuilds the entire pod. This means that, to upgrade the log collection component of the infrastructure, all other components, especially application servers, will be deleted and restarted, which affects normal business operation. Therefore, changes to multiple components still have not been decoupled.
For an application, if a pod has a local caching component and the caching process is restarted every time this application is released, the cache hit rate drops greatly during the application release, which affects the performance and even the user experience. In addition, if component upgrades by the infrastructure, middleware, and other teams are tied with the component upgrade of this application, it is very difficult to achieve iterative technical upgrades. Assume that the middleware team has launched a new version of service mesh. The release of each application is required to update the mesh components. In this situation, the technical upgrade of the middleware is dramatically retarded.
Therefore, in contrast to pod-level immutability, we believe that adhering to immutability at the container level can give full play to the technical advantages of multi-container pods of Kubernetes. To this end, we have developed the ability of modifying part of the containers in place in the pod when releasing an application. In particular, we have built a workload controller that supports in-place container upgrade, and have replaced the default deployment and the StatefulSet controller of Kubernetes with the new workload controller to handle most internal workloads. Additionally,
SidecarSet has been built to support the upgrade of sidecar containers across applications, facilitating the upgrade of infrastructure and middleware components. The support for in-place upgrade provides additional advantages such as cluster distribution determination and image download acceleration. These features have been made open-source through the OpenKruise project. For the name of OpenKruise, Kruise sounds like "cruise", and the "K" letter stands for Kubernetes. Altogether, OpenKruise represents the automatic cruising of applications on Kubernetes, which fully utilizes Alibaba's years of management experience in application deployment and the best practices of Alibaba's cloud-native economy. Currently, OpenKruise is planning to release more controllers to support more scenarios and functions, such as rich release policies, Canary release, blue-green release, and batch release.
This year, we implemented Kubernetes at a large scale and withstood the real-world test of Double 11. For scenarios with a large number of existing applications like Alibaba, there is no shortcut for implementing Kubernetes. We have quit the idea of rapid and large-scale implementation and chose not to make the implementation compatible with outdated O&M methods. Instead, we strove to lay a solid foundation in order to fully extract the value of cloud native. In the future, we will continue to promote the cloud-native transformation of more applications, especially stateful applications, to make the deployment and O&M of stateful applications more efficient. In addition, we will also promote the cloud-native transformation of the entire application delivery system to make application delivery more efficient and standardized.