Enabling Rolling Updates for Applications in Kubernetes with Zero Downtime
By Zibai, an Alibaba Cloud Development Engineer and Xiheng, an Alibaba Cloud Technical Expert
In Kubernetes clusters, business applications usually provide external services in the form of Deployment combined with the Load Balancer service. Figure 1 shows a typical deployment architecture. This architecture is very simple and convenient for deployment and O&M, but downtime may occur and cause online problems during application updates or upgrades. Today, let’s take a close look at why this architecture may cause downtime during application updates and how to prevent downtime.
Reasons for Downtime
A new pod will be created before deployment is updated in a rolling manner. The old pod will be deleted after the new pod starts to run.
Reason for Downtime: After the pod starts to run, it is added to endpoints. After detecting the change in the endpoints, Container Service for Kubernetes (ACK) adds the corresponding node to the backend of the Server Load Balancer (SLB) instance. At this time, the request is forwarded from the SLB instance to the pod. However, the application code in the pod has not yet been initialized, so the pod cannot handle the request. This results in downtime, as shown in Figure 2.
readinessProbe for the pod, and add the corresponding node to the backend of the SLB instance after the application code in the pod is initialized.
When you delete the old pod, you must synchronize the status of multiple objects, such as the endpoint, IPVS, iptables, and SLB instance. These sync operations are performed asynchronously. Figure 3 shows the overall synchronization process.
- Change the Pod Status: Set the pod status to Terminating and remove it from the endpoints of all services. At this time, the pod stops receiving new traffic, but the containers running within the pod will not be affected.
- Execute the
preStophook will be triggered when the pod is deleted.
preStopsupports bash scripts and TCP or HTTP requests.
- Send the SIGTERM Signal: Send the SIGTERM signal to containers in the pod.
- *Specify the Wait Time: The
terminationGracePeriodSecondsfield is used to control the wait time. The default value is 30 seconds. This step is performed simultaneously with the preStop hook, so the
terminationGracePeriodSecondsvalue must be higher than the execution time of
preStop. Otherwise, the pod may be killed before
preStopis completely executed.
- Send the SIGKILL Signal: After the specified wait time, send the SIGKILL signal to containers in the pod to delete the pod.
Reason for Downtime: The preceding steps 1, 2, 3, and 4 are performed simultaneously, so it is possible that the pod has not been removed from endpoints after it received the SIGTERM signal and stopped working. At this time, the request is forwarded from the SLB instance to the pod, but the pod has already stopped working. Therefore, downtime may occur, as shown in Figure 4.
Solution: Configure the
preStop hook for the pod, so the pod can sleep for some time after it receives SIGTERM, instead of stopping working immediately. This ensures that the traffic forwarded from the SLB instance can be processed by the pod.
iptables or IPVS
Reason for Downtime: When the pod status changes to Terminating, the pod will be removed from the endpoints of all services. kube-proxy cleans up corresponding iptables or IPVS rules. After detecting the change in the endpoints, ACK will call
slbopenapi to remove the backend of endpoints. This operation will take several seconds. The two operations are performed simultaneously, so it is possible that the iptables or IPVS rules on the node have been cleaned up, whereas the node has not been removed from the backend of the SLB instance. At this time, traffic flows from the SLB instance. However, downtime occurs because there are not any corresponding iptables or IPVS rules on the node, as shown in Figure 5.
- Cluster Mode: In the Cluster mode, kube-proxy writes records of all running pods to iptables or IPVS of the node. If this node does not have any running pods, the request will be forwarded to another node, so downtime will not occur, as shown in Figure 6.
- Local Mode: In the Local mode, kube-proxy writes records of only the pods on the node to iptables or IPVS. When the node has only one pod and the pod status has changed to Terminating, the record of the pod will be removed from iptables or IPVS. At this time, when the request is forwarded to this node, there are not any corresponding pod records in iptables or IPVS. This results in a request failure. You can avoid the problem by using an in-place upgrade. You must ensure that the node has at least one running pod during the upgrade. This method ensures that at least one running pod is always recorded in iptables or IPVS of the node, so downtime will not occur, as shown in Figure 7.
- ENI Mode: The Elastic Network Interface (ENI) mode bypasses kube-proxy and mounts the pod directly to the backend of the SLB instance. Therefore, the absence of records of a running pod in iptables or IPVS rules of the node does not cause downtime.
Reason for Downtime: After detecting the change in the endpoints, ACK will remove the node from the backend of the SLB instance. When the node is removed from the backend of the SLB instance, the SLB instance will directly terminate persistent connections that are continuing in the node. This results in downtime.
Solution: Configure graceful persistent connection termination for the SLB instance (depending on specific cloud vendors.)
Methods to Avoid Downtime
To avoid downtime, we can start with pod and service resources. Next, we will introduce the configuration methods corresponding to the preceding reasons for downtime.
- name: nginx
# Liveness probe
# Readiness probe
# Graceful termination
Note: You must properly set the probe frequency, delay time, unhealthiness threshold and other parameters for
readinessProbe. The startup time for some applications is long. If the set time is too short, the pod will be restarted repeatedly.
livenessProberepresents a probe for liveness. If the number of failures reaches
failureThreshold, the pod will be restarted. For more information about specific configurations, see the official documentation.
readinessProberepresents a probe for readiness. Only after the readiness probe is passed can the pod be added to endpoints. After detecting the change in the endpoints, ACK will mount the node to the backend of the SLB instance.
- We recommend that you set the execution time of
preStopto the time required for the running pod to process all remaining requests and set the
terminationGracePeriodSecondsvalue to at least 30 seconds higher than the execution time of
Cluster Mode (
- port: 80
ACK will mount all nodes in the cluster to the backend of the SLB instance (except for backend servers configured with
BackendLabel), so the SLB instance quota is consumed quickly. SLB limits the number of SLB instances that can be mounted to each Elastic Compute Service (ECS) instance. The default value is 50. When the quota is used up, listeners and SLB instances cannot be created.
In the Cluster mode, if the current node does not have a running pod, a request will be forwarded to another node. NAT is required by cross-node forwarding, so the source IP address may be lost.
Local Mode (
- port: 80
# Ensure that each node has at least one running pod during the update.
# Ensure in-place rolling update by modifying UpdateStrategy and using nodeAffinity.
# * Set Max Unavailable in UpdateStrategy to 0, so that a pod is terminated only after a new pod is started.
# * Label several particular nodes for scheduling.
# * Use nodeAffinity+ and a number of replicas that is greater than the number of related nodes to ensure in-place build of a new pod.
# For example,
- weight: 1
- key: deploy
By default, ACK will add the node where the pod corresponding to the service is located to the backend of the SLB instance. Therefore, the SLB instance quota is consumed slowly. In the local mode, a request is directly forwarded to the node where the pod is located, and cross-node forwarding is not involved. Therefore, the source IP address is reserved. In the local mode, you can avoid downtime by using in-place upgrade. The yaml file is shown above.
ENI Mode (Alibaba Cloud-Specific Mode)
- name: http
In the Terway network mode, you can create an SLB instance in the ENI mode by configuring the
service.beta.kubernetes.io/backend-type: "eni" annotation. In the ENI mode, the pod will be directly mounted to the backend of the SLB instance, while traffic is not processed by kube-proxy, so downtime does not occur. A request is forwarded directly to the pod, so the source IP address can be reserved.
The following table shows the comparison among the three service modes.
You can choose the combination of the service in the ENI mode, graceful pod termination, and
- If the number of SLB instances in the cluster is not large and the source IP does not need to be reserved, you can choose the combination of the Cluster mode, graceful pod termination, and
- If the number of SLB instances in the cluster is large or the source IP needs to be reserved, you can choose the combination of the Local mode, graceful pod termination,
readinessProbe, and in-place upgrade. Make sure that each node has a running pod during the upgrade.