Getting Started with Kubernetes | Stateful Application Orchestration with StatefulSets

Image for post
Image for post

By Wang Siyu (Jiuzhu), Technical Expert at Alibaba

For application O&M engineers, deploying and delivering stateful applications is no easy task. Common stateful operations include configuring disk persistence, assigning independent and stable network identifiers to all machines, and defining the publishing order. Kubernetes provides StatefulSets, a type of controller or workload used to deploy and run stateful applications in a Kubernetes environment.

1. Requirements of Stateful Applications

Kubernetes provides Deployments for managing application orchestration.

A Deployment provides the following functions:

Image for post
Image for post
  • It allows you to define the desired number of pods. A Deployment controller ensures that the desired number of pods is maintained for the target version.

Simply put, a Deployment manages pods of the same version as identical replicas. Deployment controllers process pods of the same version in the same way, no matter what applications are deployed or what pod behaviors are configured.

This capability can meet the requirements of stateless applications, but what if you run stateful applications?

Requirements Analysis

Let’s take a look at the following requirements:

Image for post
Image for post

The preceding requirements cannot be met through Deployments. Kubernetes provides StatefulSets for managing stateful applications.

StatefulSet: a Controller for Stateful Application Management

Many stateless applications in the community are also managed through StatefulSets. This article will explore why.

Image for post
Image for post

As shown in the right part of the preceding figure, each pod in a StatefulSet has an ordinal, which ranges from 0 to the defined number of replicas minus 1. Each pod has an independent network identifier (that is, a hostname), an independent Persistent Volume Claim (PVC), and one or more Persistent Volumes (PVs). Each different pod in the same StatefulSet has a unique hostname and an exclusive storage disk. This meets the needs of many stateful applications.

As shown in the right part of the preceding figure:

  • Each pod has an ordinal, which determines the specific order in which the pods are created, deleted, and upgraded.

2. Case Study

Create a StatefulSet

Image for post
Image for post

As shown in the left part of the preceding figure, a headless service is configured to assign an independent hostname to each pod in a StatefulSet. The service name is nginx.

In the right part, a StatefulSet is configured, where serviceName is set to nginx under spec. serviceName indicates the service that matches the StatefulSet.

spec contains other fields, such as selector and template. The selector field indicates a label selector. The label selection logic defined by selector must match app: nginx in template.metadata.labels. The template field defines an NGINX container of the Alpine image version, with Port 80 exposed as a web service.

template.spec defines volumeMounts. This volumeMounts comes from volumeClaimTemplates, which is the PVC template. The PVC template defines a PVC called www-storage.volumeMounts.name is set to this PVC name, and the volume is mounted to the /usr/share/nginx/html directory. In this way, each pod has an independent PVC and is mounted to the corresponding directory in the container.

Statuses of Services and StatefulSets

Image for post
Image for post

After you create a headless service and a StatefulSet, run the get command to verify that the NGINX service has been created.

The get endpoints nginx command output shows that the NGINX service registers three IP addresses and a port. The IP addresses map to the pod IP addresses, and the port maps to Port 80 configured under spec.

The get sts nginx-web command output includes the READY column with the value 3/3. The first number 3 indicates the number of pods in the desired Ready state in the StatefulSet, and the second number 3 indicates the desired number of pods in the StatefulSet.

Statuses of Pods and PVCs

The get pod command output shows that all the three pods are in the Running state and are ready. The pod IP addresses map to the IP addresses in the get endpoints nginx command output.

Image for post
Image for post

In the get pvc command output, the NAME column includes names that consist of three elements: www-storage, nginx-web, and a number as the suffix. www-storage is defined by volumeClaimTemplates. nginx-web is defined by the StatefulSet. The number is the ordinal number of the pod. Each PVC is bound to one of the three pods. Each PVC is also bound to a PV so that different pods can be bound to different PVs.

Pod Versions

Image for post
Image for post

Deployments use ReplicaSets to manage pod versions and keep the desired number of pods, whereas StatefulSets act as a type of controller to manage pods. A StatefulSet identifies the version of each managed pod through the pod label controller-revision-hash. The label is similar to the pod template hash that is injected to a pod by a Deployment or StatefulSet.

As shown in the preceding figure, the get pod command output includes controller-revision-hash, where hash indicates the template version of the pod when created for the first time. It ends with 677759c9b8. Note down the value of controller-revision-hash to check whether it is changed when pods are upgraded later.

Update Images

Image for post
Image for post

As shown in the preceding figure, the command applies the StatefulSet configuration where the image version of the StatefulSet is upgraded to mainline.

View the Status of the New Version

Image for post
Image for post

Run the get pod command to query revision hash. The command output shows that the controller-revision-hash values of the three pods are upgraded to the new revision hash and end with 7c55499668. By viewing the pod creation time, you can see that Pod 2 is created first, followed by Pod 1 and Pod 0. The pod creation times determine the pod upgrade order as follows: Pod 2 -> Pod 1 -> Pod 0. The PVCs used by the pods remain unchanged after the upgrade. The data stored in PVs before the upgrade is mounted to the upgraded pods.

As shown in the right part of the preceding figure, there are several key fields in the status part of the StatefulSet:

  • currentReplicas: the number of replicas in the current version.

All pods are upgraded to the target version if currentReplicas is the same as updateReplicas and currentRevision is the same as updateRevision.

3. Practice

StatefulSet Orchestration File

Assume that you have connected to an Alibaba Cloud cluster, which has three nodes.

Image for post
Image for post

The following shows how to create a StatefulSet and a service. Let’s look at the orchestration file.

Image for post
Image for post

As shown in the preceding figure, in the service configuration, Port 80 is exposed for NGINX. In the StatefulSet configuration, metadata.name is set to nginx-web, image information is defined in template.containers, and a volumeClaimTemplates is defined as a PVC template.

Start to Create

Image for post
Image for post

Run the command shown in the preceding figure to create a service and a StatefulSet. Run the get pod command to verify that Pod 0 was created first. Run the get pvc command to verify that PVC 0 is bound to a PV.

Image for post
Image for post

The preceding command output shows that Pod 0 is being created and is in the ContainerCreating state.

Image for post
Image for post

After Pod 0 is created, Pod 1 and Pod 2 are created in sequence, and related PVCs are also created.

Image for post
Image for post

A PVC is created before each pod is created. After the PVC is created, the pod transits from the Pending state to the Bound state, indicating that the pod is bound to the PV, and then enters the ContainerCreating state and finally reaches the Running state.

View the Status

Run the kubectl get sts nginx-web -o yaml command to view the StatefulSet status.

Image for post
Image for post

As shown in the preceding command output, the desired number of replicas is 3, the number of available replicas is also 3, and the current version is the latest.

Image for post
Image for post

Port 80 is exposed for the NGINX service, which has three IP addresses.

Image for post
Image for post

Run the get pod command to verify that the IP addresses of the three pods map to the IP addresses under ENDPOINTS.

In short, the three PVCs and three pods reach the desired status. In the status data of the StatefulSet, the values of currentReplicas and readyReplicas are both 3.

Upgrade

Image for post
Image for post

kubectl set image is fixed and used to declare an image. StatefulSet indicates the resource type. nginx-web indicates the resource name. For nginx=nginx:mainline, nginx before the equal sign indicates the container name defined under template, and nginx:mainline indicates the target image version.

By running the preceding command, you can upgrade the image in the StatefulSet to the target version.

Image for post
Image for post

Run the get pod command to view the pod status. nginx-web-1 and nginx-web-2 are in the Running state. Their controller-revision-hash indicates the latest version. The original nginx-web-0 pod has been deleted, and the new pod is being created.

Image for post
Image for post

Check the pod status again, and you can see that all pods are in the Running state.

Image for post
Image for post

View the StatefulSet information. In the status part, currentRevision shows the latest version, indicating that the three pods in the StatefulSet are of the latest version.

Image for post
Image for post

How do we determine whether the three pods still use their original hostnames and PVCs after the upgrade?

The hostnames configured for the headless service are associated only with pod names and, therefore, can be reused by the upgraded pods as long as the pod names remain unchanged after the upgrade.

The preceding command output shows that the PVC creation time is still the same as the time when the pod was created for the first time, indicating that the upgraded pods still use their original PVCs.

Image for post
Image for post

For example, by viewing the details of a pod, you can see that, in the pod’s volumes declaration, the name www-storage-nginx-web-0 under persistentVolumeClaim matches the PVC with the ordinal 0, which is the PVC used by the old pod. When a pod is upgraded, the StatefulSet controller deletes the old pod and creates a pod with the same name as the old pod, so the new pod reuses the PVC of the old pod.

This enables the reuse of network storage after pod upgrades.

4. Architecture Design

Management Mode

A StatefulSet supports the creation of three types of resources:

  • ControllerRevision

ControllerRevision allows the StatefulSet to manage different template versions.

For example, a ControllerRevision is created for the first template version of the NGINX service when this service is created. After the image version is modified, the StatefulSet controller creates another ControllerRevision. In other words, each ControllerRevision maps to a template version and also maps to the version’s ControllerRevision hash. ControllerRevision is named after ControllerRevision hash that is defined by the pod label. ControllerRevision allows the StatefulSet controller to manage different template versions.

  • PVC

If you define volumeClaimTemplates in a StatefulSet, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume.

If you define volumeClaimTemplates for the PVC template under spec, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume. If you do not define a PVC template under spec, then no independent PV is mounted to the created pod.

  • Pod

A StatefulSet creates, deletes, and upgrades pods in order. Each pod has a unique ordinal.

Image for post
Image for post

As shown in the preceding figure, the StatefulSet controller owns three types of resources: ControllerRevision, pod, and PVC.

In the current version, the StatefulSet adds OwnerReferences only to ControllerRevisions and pods, but not to PVCs. When a resource with OwnerReferences is deleted, its associated resources are also deleted by default. Therefore, after a StatefulSet is deleted, the ControllerRevisions and pods created by the StatefulSet are also deleted. However, the created PVCs are not deleted because they do not have OwnerReferences.

StatefulSet Controller

Image for post
Image for post

The preceding figure shows the workflow of a StatefulSet controller.

The StatefulSet controller first registers the event handlers of the informer to process changes of the StatefulSet and its pods. In the controller logic, upon receiving a change of the StatefulSet or a pod, the StatefulSet controller queues up the StatefulSet. Then, the StatefulSet controller dequeues the StatefulSet and performs the Update Revision operation. That is, the StatefulSet controller checks whether the template of the StatefulSet has a ControllerRevision. If no ControllerRevision is available, the template has been updated. In this case, the StatefulSet controller creates a revision, resulting in a new version of ControllerRevision hash.

Then, the StatefulSet controller fetches all versions and sorts them by ordinal. If any pods are missing, they are created in the order of their ordinals. If any pods are redundant, they are deleted in the order of their ordinals. When the number of pods and the pod ordinals are consistent with the number and ordinals of replicas, the StatefulSet controller checks whether to upgrade the pods. In the Manage pods in order process, the StatefulSet controller checks whether pods are sorted by ordinal. In the Update in order process, the StatefulSet controller checks whether pods are of the desired version. If not, the StatefulSet controller upgrades the pods in the order of their ordinals.

The Update in order process is essentially a process of deleting pods. After a pod is deleted, the StatefulSet controller will find that this pod is missing based on the acquired StatefulSet. Then, the StatefulSet controller creates a pod during the Manage Pods in order process. After that, the StatefulSet controller updates the status. The resulting status can be displayed by running the command described in the preceding sections.

Through this workflow, the StatefulSet controller can manage stateful applications.

Scale-up Simulation

Image for post
Image for post

Assume that the initial configuration of a StatefulSet is as follows: The number of replicas is 1 and one pod, Pod 0, is managed. If you change the number of replicas from 1 to 3, Pod 1 is created first, and Pod 2 is created when Pod 1 is in the Ready state.

As shown in the preceding figure, pods in the StatefulSet are created and numbered from 0. The ordinals of the pods in a StatefulSet with N replicas fall in the range [0, N). When N is greater than 0, the ordinals of pods range from 0 to N — 1.

Scaling Management Policies

Image for post
Image for post

If you do not want to create or delete pods in the order of their ordinals, StatefulSets allow you to do so through other logic. This is why some people in the community also use StatefulSets to manage stateless applications. A StatefulSet assigns a unique network identifier and independent network storage to each of its managed pods and supports scaling pods concurrently.

StatefulSet.spec includes the podManagementPolicy field, which can be set to OrderedReady (default value) or Parallel.

If podManagementPolicy is unspecified, the StatefulSet controller uses OrderedReady by default, and pods are scaled in order. This means a pod is scaled only after the previous pods are in the Ready state. When pods are scaled down, they are deleted in the descending order of their ordinals.

For example, in the preceding figure, when the StatefulSet is scaled from Pod 0 to Pod 0, Pod 1, and Pod 2, Pod 1 is created first and then Pod 2 is created only when Pod 1 is in the Ready state. If Pod 0 changes to the Not Ready state due to a host or application error when Pod 1 is being created, then the StatefulSet controller will not create Pod 2. This means that a pod is created only when all the previous pods are in the Ready state. In this example, the StatefulSet can create Pod 2 only after Pod 0 and Pod 1 are all in the Ready state.

If podManagementPolicy is set to Parallel, pods are scaled in parallel, without having to wait until all preceding pods are ready or pods with greater ordinals are deleted.

Publishing Simulation

Image for post
Image for post

Assume that StatefulSet Template 1 maps to Revision 1 in the logic. The three pods in the StatefulSet are of the Revision 1 version. After you modify the template, such as modifying the image, the StatefulSet controller upgrades pods one by one in descending order of their ordinals. The StatefulSet controller first creates Revision 2, which maps to the ControllerRevision 2 resource, whose name is used as a new revision hash. After Pod 2 is upgraded to the new version, the StatefulSet controller deletes Pod 0 and Pod 1 in order and then creates Pod 0 and Pod 1 in the same order.

The logic here is simple. In the upgrade process, the StatefulSet controller deletes the mapped pod with the greatest ordinal. During the next reconcile cycle, the StatefulSet controller finds that the pod with the greatest ordinal is missing and then creates a pod of the new version.

Spec Fields

Image for post
Image for post

Let’s take a look at the following spec fields:

  • replicas: the desired number of replicas.
Image for post
Image for post
  • serviceName: the name of a headless service. If you do not want to configure a headless service, you can set this field to a nonexistent value, which is not checked by the StatefulSet controller. We recommend that you configure a headless service regardless of whether or not the pods in a StatefulSet require hostnames.

updateStrategy Field

Image for post
Image for post

As shown in the right part of the preceding figure, the StatefulSetUpdateStrategyType field can be set to RollingUpdate or OnDelete.

  • RollingUpdate: Pods are upgraded in rolling upgrade mode, which is similar to the upgrade method used by Deployments.

RollingUpdateStatefulSetStrategy has the Partition field to indicate the number of pods that keep the old version during the rolling upgrade. This is different from the number of pods that are upgraded to the new version during the grayscale upgrade.

For example, assume that a StatefulSet has 10 replicas and that the Partition field is set to 8. In this case, eight pods keep the old version, and the other two pods are upgraded to the new version during the grayscale upgrade. When there are 10 replicas, the pod ordinals fall in the range [0, 9). When Partition is set to 8, the eight pods in the ordinal range [0,7) keep the old version, and the pods in the ordinal range [8, 9) are upgraded to the new version.

Assume replicas = N and Partition = M (M < N). The pods that keep the old version fall in the ordinal range [0, M), and the pods upgraded to the new version fall in the ordinal range [M, N). The Partition field can be used for the grayscale upgrade, which is currently not supported by Deployments.

5. Summary

Let’s summarize what we have learned in this article:

  • StatefulSets are a common type of workload in Kubernetes intended for stateful application deployment, but they can also be used to deploy stateless applications.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.