Enabling Rolling Updates for Applications in Kubernetes with Zero Downtime

9 min readNov 2, 2020

By Zibai, an Alibaba Cloud Development Engineer and Xiheng, an Alibaba Cloud Technical Expert

In Kubernetes clusters, business applications usually provide external services in the form of Deployment combined with the Load Balancer service. Figure 1 shows a typical deployment architecture. This architecture is very simple and convenient for deployment and O&M, but downtime may occur and cause online problems during application updates or upgrades. Today, let’s take a close look at why this architecture may cause downtime during application updates and how to prevent downtime.

Reasons for Downtime

A new pod will be created before deployment is updated in a rolling manner. The old pod will be deleted after the new pod starts to run.

Create Pods

Reason for Downtime: After the pod starts to run, it is added to endpoints. After detecting the change in the endpoints, Container Service for Kubernetes (ACK) adds the corresponding node to the backend of the Server Load Balancer (SLB) instance. At this time, the request is forwarded from the SLB instance to the pod. However, the application code in the pod has not yet been initialized, so the pod cannot handle the request. This results in downtime, as shown in Figure 2.

Solution: Configure readinessProbe for the pod, and add the corresponding node to the backend of the SLB instance after the application code in the pod is initialized.

Delete Pods

When you delete the old pod, you must synchronize the status of multiple objects, such as the endpoint, IPVS, iptables, and SLB instance. These sync operations are performed asynchronously. Figure 3 shows the overall synchronization process.

*Figure 3: Sequence Diagram for Deployment Update*

Pods

Change the Pod Status: Set the pod status to Terminating and remove it from the endpoints of all services. At this time, the pod stops receiving new traffic, but the containers running within the pod will not be affected.
Execute the preStop hook: The preStop hook will be triggered when the pod is deleted. preStop supports bash scripts and TCP or HTTP requests.
Send the SIGTERM Signal: Send the SIGTERM signal to containers in the pod.
*Specify the Wait Time: The terminationGracePeriodSeconds field is used to control the wait time. The default value is 30 seconds. This step is performed simultaneously with the preStop hook, so the terminationGracePeriodSeconds value must be higher than the execution time of preStop. Otherwise, the pod may be killed before preStop is completely executed.
Send the SIGKILL Signal: After the specified wait time, send the SIGKILL signal to containers in the pod to delete the pod.

Reason for Downtime: The preceding steps 1, 2, 3, and 4 are performed simultaneously, so it is possible that the pod has not been removed from endpoints after it received the SIGTERM signal and stopped working. At this time, the request is forwarded from the SLB instance to the pod, but the pod has already stopped working. Therefore, downtime may occur, as shown in Figure 4.

Solution: Configure the preStop hook for the pod, so the pod can sleep for some time after it receives SIGTERM, instead of stopping working immediately. This ensures that the traffic forwarded from the SLB instance can be processed by the pod.

iptables or IPVS

Reason for Downtime: When the pod status changes to Terminating, the pod will be removed from the endpoints of all services. kube-proxy cleans up corresponding iptables or IPVS rules. After detecting the change in the endpoints, ACK will call slbopenapi to remove the backend of endpoints. This operation will take several seconds. The two operations are performed simultaneously, so it is possible that the iptables or IPVS rules on the node have been cleaned up, whereas the node has not been removed from the backend of the SLB instance. At this time, traffic flows from the SLB instance. However, downtime occurs because there are not any corresponding iptables or IPVS rules on the node, as shown in Figure 5.

Solution:

Cluster Mode: In the Cluster mode, kube-proxy writes records of all running pods to iptables or IPVS of the node. If this node does not have any running pods, the request will be forwarded to another node, so downtime will not occur, as shown in Figure 6.

*Figure 6: Request Forwarding in Cluster Mode*

Local Mode: In the Local mode, kube-proxy writes records of only the pods on the node to iptables or IPVS. When the node has only one pod and the pod status has changed to Terminating, the record of the pod will be removed from iptables or IPVS. At this time, when the request is forwarded to this node, there are not any corresponding pod records in iptables or IPVS. This results in a request failure. You can avoid the problem by using an in-place upgrade. You must ensure that the node has at least one running pod during the upgrade. This method ensures that at least one running pod is always recorded in iptables or IPVS of the node, so downtime will not occur, as shown in Figure 7.

*Figure 7: Request Forwarding During In-Place Upgrade in the Local Mode*

ENI Mode: The Elastic Network Interface (ENI) mode bypasses kube-proxy and mounts the pod directly to the backend of the SLB instance. Therefore, the absence of records of a running pod in iptables or IPVS rules of the node does not cause downtime.

*Figure 8: Request Forwarding in ENI Mode*

SLB

Reason for Downtime: After detecting the change in the endpoints, ACK will remove the node from the backend of the SLB instance. When the node is removed from the backend of the SLB instance, the SLB instance will directly terminate persistent connections that are continuing in the node. This results in downtime.

Solution: Configure graceful persistent connection termination for the SLB instance (depending on specific cloud vendors.)

Methods to Avoid Downtime

To avoid downtime, we can start with pod and service resources. Next, we will introduce the configuration methods corresponding to the preceding reasons for downtime.

Pod Configuration

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
  - name: nginx
    image: nginx
    # Liveness probe
    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 30
      periodSeconds: 30
      successThreshold: 1
      tcpSocket:
        port: 5084
      timeoutSeconds: 1
    # Readiness probe
    readinessProbe:
      failureThreshold: 3
      initialDelaySeconds: 30
      periodSeconds: 30
      successThreshold: 1
      tcpSocket:
        port: 5084
      timeoutSeconds: 1
    # Graceful termination
    lifecycle: 
      preStop: 
        exec: 
          command: 
          - sleep
          - 30
  terminationGracePeriodSeconds: 60

Note: You must properly set the probe frequency, delay time, unhealthiness threshold and other parameters for readinessProbe. The startup time for some applications is long. If the set time is too short, the pod will be restarted repeatedly.

livenessProbe represents a probe for liveness. If the number of failures reaches failureThreshold, the pod will be restarted. For more information about specific configurations, see the official documentation.
readinessProbe represents a probe for readiness. Only after the readiness probe is passed can the pod be added to endpoints. After detecting the change in the endpoints, ACK will mount the node to the backend of the SLB instance.
We recommend that you set the execution time of preStop to the time required for the running pod to process all remaining requests and set the terminationGracePeriodSeconds value to at least 30 seconds higher than the execution time of preStop.

Service Configuration

Cluster Mode (`externalTrafficPolicy`: Cluster)

apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
spec:
  externalTrafficPolicy: Cluster
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer

ACK will mount all nodes in the cluster to the backend of the SLB instance (except for backend servers configured with BackendLabel), so the SLB instance quota is consumed quickly. SLB limits the number of SLB instances that can be mounted to each Elastic Compute Service (ECS) instance. The default value is 50. When the quota is used up, listeners and SLB instances cannot be created.

In the Cluster mode, if the current node does not have a running pod, a request will be forwarded to another node. NAT is required by cross-node forwarding, so the source IP address may be lost.

Local Mode (`externalTrafficPolicy`: Local)

apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
spec:
  externalTrafficPolicy: Local
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer
# Ensure that each node has at least one running pod during the update.
# Ensure in-place rolling update by modifying UpdateStrategy and using nodeAffinity.
# * Set Max Unavailable in UpdateStrategy to 0, so that a pod is terminated only after a new pod is started.
# * Label several particular nodes for scheduling.
# * Use nodeAffinity+ and a number of replicas that is greater than the number of related nodes to ensure in-place build of a new pod.
# For example,
apiVersion: apps/v1
kind: Deployment
......
strategy:
  rollingUpdate:
    maxSurge: 50%
    maxUnavailable: 0%
  type: RollingUpdate
......
    affinity:
      nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
              matchExpressions:
            - key: deploy
              operator: In
              values:
              - nginx

By default, ACK will add the node where the pod corresponding to the service is located to the backend of the SLB instance. Therefore, the SLB instance quota is consumed slowly. In the local mode, a request is directly forwarded to the node where the pod is located, and cross-node forwarding is not involved. Therefore, the source IP address is reserved. In the local mode, you can avoid downtime by using in-place upgrade. The yaml file is shown above.

ENI Mode (Alibaba Cloud-Specific Mode)

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/backend-type: "eni"
  name: nginx
spec:
  ports:
  - name: http
    port: 30080
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

In the Terway network mode, you can create an SLB instance in the ENI mode by configuring the service.beta.kubernetes.io/backend-type: "eni" annotation. In the ENI mode, the pod will be directly mounted to the backend of the SLB instance, while traffic is not processed by kube-proxy, so downtime does not occur. A request is forwarded directly to the pod, so the source IP address can be reserved.

The following table shows the comparison among the three service modes.

Summary

Terway (Recommended)

You can choose the combination of the service in the ENI mode, graceful pod termination, and readinessProbe.

Flannel

If the number of SLB instances in the cluster is not large and the source IP does not need to be reserved, you can choose the combination of the Cluster mode, graceful pod termination, and readinessProbe.
If the number of SLB instances in the cluster is large or the source IP needs to be reserved, you can choose the combination of the Local mode, graceful pod termination, readinessProbe, and in-place upgrade. Make sure that each node has a running pod during the upgrade.