Getting Started with Kubernetes | Stateful Application Orchestration with StatefulSets
By Wang Siyu (Jiuzhu), Technical Expert at Alibaba
For application O&M engineers, deploying and delivering stateful applications is no easy task. Common stateful operations include configuring disk persistence, assigning independent and stable network identifiers to all machines, and defining the publishing order. Kubernetes provides StatefulSets, a type of controller or workload used to deploy and run stateful applications in a Kubernetes environment.
1. Requirements of Stateful Applications
Kubernetes provides Deployments for managing application orchestration.
A Deployment provides the following functions:
- It allows you to define the desired number of pods. A Deployment controller ensures that the desired number of pods is maintained for the target version.
- It allows you to configure the pod publishing mode. A Deployment controller upgrades pods based on the predefined policy and keeps the number of unavailable pods within the specified limit during the upgrade process.
- It allows you to roll back the publishing process with one click when a problem occurs.
Simply put, a Deployment manages pods of the same version as identical replicas. Deployment controllers process pods of the same version in the same way, no matter what applications are deployed or what pod behaviors are configured.
This capability can meet the requirements of stateless applications, but what if you run stateful applications?
Let’s take a look at the following requirements:
The preceding requirements cannot be met through Deployments. Kubernetes provides StatefulSets for managing stateful applications.
StatefulSet: a Controller for Stateful Application Management
Many stateless applications in the community are also managed through StatefulSets. This article will explore why.
As shown in the right part of the preceding figure, each pod in a StatefulSet has an ordinal, which ranges from 0 to the defined number of replicas minus 1. Each pod has an independent network identifier (that is, a hostname), an independent Persistent Volume Claim (PVC), and one or more Persistent Volumes (PVs). Each different pod in the same StatefulSet has a unique hostname and an exclusive storage disk. This meets the needs of many stateful applications.
As shown in the right part of the preceding figure:
- Each pod has an ordinal, which determines the specific order in which the pods are created, deleted, and upgraded.
- You can configure a headless service to assign a unique hostname to each pod.
- You can configure a PVC template to assign one or more PVs to each pod.
- You can publish a certain number of pods through phased release. For example, a StatefulSet has three replicas. You can upgrade one, two, or all of the replicas to the latest version. That is, you can specify the number of replicas for the grayscale upgrade.
2. Case Study
Create a StatefulSet
As shown in the left part of the preceding figure, a headless service is configured to assign an independent hostname to each pod in a StatefulSet. The service name is nginx.
In the right part, a StatefulSet is configured, where serviceName is set to nginx under spec. serviceName indicates the service that matches the StatefulSet.
spec contains other fields, such as selector and template. The selector field indicates a label selector. The label selection logic defined by selector must match app: nginx in template.metadata.labels. The template field defines an NGINX container of the Alpine image version, with Port 80 exposed as a web service.
template.spec defines volumeMounts. This volumeMounts comes from volumeClaimTemplates, which is the PVC template. The PVC template defines a PVC called
www-storage.volumeMounts.name is set to this PVC name, and the volume is mounted to the /usr/share/nginx/html directory. In this way, each pod has an independent PVC and is mounted to the corresponding directory in the container.
Statuses of Services and StatefulSets
After you create a headless service and a StatefulSet, run the get command to verify that the NGINX service has been created.
get endpoints nginx command output shows that the NGINX service registers three IP addresses and a port. The IP addresses map to the pod IP addresses, and the port maps to Port 80 configured under spec.
get sts nginx-web command output includes the READY column with the value 3/3. The first number 3 indicates the number of pods in the desired Ready state in the StatefulSet, and the second number 3 indicates the desired number of pods in the StatefulSet.
Statuses of Pods and PVCs
get pod command output shows that all the three pods are in the Running state and are ready. The pod IP addresses map to the IP addresses in the
get endpoints nginx command output.
get pvc command output, the NAME column includes names that consist of three elements: www-storage, nginx-web, and a number as the suffix. www-storage is defined by volumeClaimTemplates. nginx-web is defined by the StatefulSet. The number is the ordinal number of the pod. Each PVC is bound to one of the three pods. Each PVC is also bound to a PV so that different pods can be bound to different PVs.
Deployments use ReplicaSets to manage pod versions and keep the desired number of pods, whereas StatefulSets act as a type of controller to manage pods. A StatefulSet identifies the version of each managed pod through the pod label controller-revision-hash. The label is similar to the pod template hash that is injected to a pod by a Deployment or StatefulSet.
As shown in the preceding figure, the
get pod command output includes controller-revision-hash, where hash indicates the template version of the pod when created for the first time. It ends with 677759c9b8. Note down the value of controller-revision-hash to check whether it is changed when pods are upgraded later.
As shown in the preceding figure, the command applies the StatefulSet configuration where the image version of the StatefulSet is upgraded to
View the Status of the New Version
get pod command to query revision hash. The command output shows that the controller-revision-hash values of the three pods are upgraded to the new revision hash and end with 7c55499668. By viewing the pod creation time, you can see that Pod 2 is created first, followed by Pod 1 and Pod 0. The pod creation times determine the pod upgrade order as follows: Pod 2 -> Pod 1 -> Pod 0. The PVCs used by the pods remain unchanged after the upgrade. The data stored in PVs before the upgrade is mounted to the upgraded pods.
As shown in the right part of the preceding figure, there are several key fields in the status part of the StatefulSet:
- currentReplicas: the number of replicas in the current version.
- currentRevision: the current version.
- updateReplicas: the number of replicas in the new version.
- updateRevision: the version to upgrade to.
All pods are upgraded to the target version if currentReplicas is the same as updateReplicas and currentRevision is the same as updateRevision.
StatefulSet Orchestration File
Assume that you have connected to an Alibaba Cloud cluster, which has three nodes.
The following shows how to create a StatefulSet and a service. Let’s look at the orchestration file.
As shown in the preceding figure, in the service configuration, Port 80 is exposed for NGINX. In the StatefulSet configuration,
metadata.name is set to nginx-web, image information is defined in template.containers, and a volumeClaimTemplates is defined as a PVC template.
Start to Create
Run the command shown in the preceding figure to create a service and a StatefulSet. Run the
get pod command to verify that Pod 0 was created first. Run the
get pvc command to verify that PVC 0 is bound to a PV.
The preceding command output shows that Pod 0 is being created and is in the ContainerCreating state.
After Pod 0 is created, Pod 1 and Pod 2 are created in sequence, and related PVCs are also created.
A PVC is created before each pod is created. After the PVC is created, the pod transits from the Pending state to the Bound state, indicating that the pod is bound to the PV, and then enters the ContainerCreating state and finally reaches the Running state.
View the Status
kubectl get sts nginx-web -o yaml command to view the StatefulSet status.
As shown in the preceding command output, the desired number of replicas is 3, the number of available replicas is also 3, and the current version is the latest.
Port 80 is exposed for the NGINX service, which has three IP addresses.
get pod command to verify that the IP addresses of the three pods map to the IP addresses under ENDPOINTS.
In short, the three PVCs and three pods reach the desired status. In the status data of the StatefulSet, the values of currentReplicas and readyReplicas are both 3.
kubectl set image is fixed and used to declare an image. StatefulSet indicates the resource type. nginx-web indicates the resource name. For
nginx=nginx:mainline, nginx before the equal sign indicates the container name defined under template, and nginx:mainline indicates the target image version.
By running the preceding command, you can upgrade the image in the StatefulSet to the target version.
get pod command to view the pod status. nginx-web-1 and nginx-web-2 are in the Running state. Their controller-revision-hash indicates the latest version. The original nginx-web-0 pod has been deleted, and the new pod is being created.
Check the pod status again, and you can see that all pods are in the Running state.
View the StatefulSet information. In the status part, currentRevision shows the latest version, indicating that the three pods in the StatefulSet are of the latest version.
How do we determine whether the three pods still use their original hostnames and PVCs after the upgrade?
The hostnames configured for the headless service are associated only with pod names and, therefore, can be reused by the upgraded pods as long as the pod names remain unchanged after the upgrade.
The preceding command output shows that the PVC creation time is still the same as the time when the pod was created for the first time, indicating that the upgraded pods still use their original PVCs.
For example, by viewing the details of a pod, you can see that, in the pod’s volumes declaration, the name www-storage-nginx-web-0 under persistentVolumeClaim matches the PVC with the ordinal 0, which is the PVC used by the old pod. When a pod is upgraded, the StatefulSet controller deletes the old pod and creates a pod with the same name as the old pod, so the new pod reuses the PVC of the old pod.
This enables the reuse of network storage after pod upgrades.
4. Architecture Design
A StatefulSet supports the creation of three types of resources:
ControllerRevision allows the StatefulSet to manage different template versions.
For example, a ControllerRevision is created for the first template version of the NGINX service when this service is created. After the image version is modified, the StatefulSet controller creates another ControllerRevision. In other words, each ControllerRevision maps to a template version and also maps to the version’s ControllerRevision hash. ControllerRevision is named after ControllerRevision hash that is defined by the pod label. ControllerRevision allows the StatefulSet controller to manage different template versions.
If you define volumeClaimTemplates in a StatefulSet, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume.
If you define volumeClaimTemplates for the PVC template under spec, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume. If you do not define a PVC template under spec, then no independent PV is mounted to the created pod.
A StatefulSet creates, deletes, and upgrades pods in order. Each pod has a unique ordinal.
As shown in the preceding figure, the StatefulSet controller owns three types of resources: ControllerRevision, pod, and PVC.
In the current version, the StatefulSet adds OwnerReferences only to ControllerRevisions and pods, but not to PVCs. When a resource with OwnerReferences is deleted, its associated resources are also deleted by default. Therefore, after a StatefulSet is deleted, the ControllerRevisions and pods created by the StatefulSet are also deleted. However, the created PVCs are not deleted because they do not have OwnerReferences.
The preceding figure shows the workflow of a StatefulSet controller.
The StatefulSet controller first registers the event handlers of the informer to process changes of the StatefulSet and its pods. In the controller logic, upon receiving a change of the StatefulSet or a pod, the StatefulSet controller queues up the StatefulSet. Then, the StatefulSet controller dequeues the StatefulSet and performs the Update Revision operation. That is, the StatefulSet controller checks whether the template of the StatefulSet has a ControllerRevision. If no ControllerRevision is available, the template has been updated. In this case, the StatefulSet controller creates a revision, resulting in a new version of ControllerRevision hash.
Then, the StatefulSet controller fetches all versions and sorts them by ordinal. If any pods are missing, they are created in the order of their ordinals. If any pods are redundant, they are deleted in the order of their ordinals. When the number of pods and the pod ordinals are consistent with the number and ordinals of replicas, the StatefulSet controller checks whether to upgrade the pods. In the
Manage pods in order process, the StatefulSet controller checks whether pods are sorted by ordinal. In the
Update in order process, the StatefulSet controller checks whether pods are of the desired version. If not, the StatefulSet controller upgrades the pods in the order of their ordinals.
Update in order process is essentially a process of deleting pods. After a pod is deleted, the StatefulSet controller will find that this pod is missing based on the acquired StatefulSet. Then, the StatefulSet controller creates a pod during the
Manage Pods in order process. After that, the StatefulSet controller updates the status. The resulting status can be displayed by running the command described in the preceding sections.
Through this workflow, the StatefulSet controller can manage stateful applications.
Assume that the initial configuration of a StatefulSet is as follows: The number of replicas is 1 and one pod, Pod 0, is managed. If you change the number of replicas from 1 to 3, Pod 1 is created first, and Pod 2 is created when Pod 1 is in the Ready state.
As shown in the preceding figure, pods in the StatefulSet are created and numbered from 0. The ordinals of the pods in a StatefulSet with N replicas fall in the range [0, N). When N is greater than 0, the ordinals of pods range from 0 to N — 1.
Scaling Management Policies
If you do not want to create or delete pods in the order of their ordinals, StatefulSets allow you to do so through other logic. This is why some people in the community also use StatefulSets to manage stateless applications. A StatefulSet assigns a unique network identifier and independent network storage to each of its managed pods and supports scaling pods concurrently.
StatefulSet.spec includes the podManagementPolicy field, which can be set to OrderedReady (default value) or Parallel.
If podManagementPolicy is unspecified, the StatefulSet controller uses OrderedReady by default, and pods are scaled in order. This means a pod is scaled only after the previous pods are in the Ready state. When pods are scaled down, they are deleted in the descending order of their ordinals.
For example, in the preceding figure, when the StatefulSet is scaled from Pod 0 to Pod 0, Pod 1, and Pod 2, Pod 1 is created first and then Pod 2 is created only when Pod 1 is in the Ready state. If Pod 0 changes to the Not Ready state due to a host or application error when Pod 1 is being created, then the StatefulSet controller will not create Pod 2. This means that a pod is created only when all the previous pods are in the Ready state. In this example, the StatefulSet can create Pod 2 only after Pod 0 and Pod 1 are all in the Ready state.
If podManagementPolicy is set to Parallel, pods are scaled in parallel, without having to wait until all preceding pods are ready or pods with greater ordinals are deleted.
Assume that StatefulSet Template 1 maps to Revision 1 in the logic. The three pods in the StatefulSet are of the Revision 1 version. After you modify the template, such as modifying the image, the StatefulSet controller upgrades pods one by one in descending order of their ordinals. The StatefulSet controller first creates Revision 2, which maps to the ControllerRevision 2 resource, whose name is used as a new revision hash. After Pod 2 is upgraded to the new version, the StatefulSet controller deletes Pod 0 and Pod 1 in order and then creates Pod 0 and Pod 1 in the same order.
The logic here is simple. In the upgrade process, the StatefulSet controller deletes the mapped pod with the greatest ordinal. During the next reconcile cycle, the StatefulSet controller finds that the pod with the greatest ordinal is missing and then creates a pod of the new version.
Let’s take a look at the following spec fields:
- replicas: the desired number of replicas.
- selector: the event selector, consistent with spec.template.metadata.labels.
- template: the pod template, which defines the basic information about the pod to be created.
- volumeClaimTemplates: a list of PVC templates. If this field is set, PVCs are created before pod templates. After a PVC is created, the PVC name is injected as a volume into the pod that is created based on a template.
- serviceName: the name of a headless service. If you do not want to configure a headless service, you can set this field to a nonexistent value, which is not checked by the StatefulSet controller. We recommend that you configure a headless service regardless of whether or not the pods in a StatefulSet require hostnames.
- podManagementPolicy: the policy that defines how pods are managed. Valid values: OrderedReady and Parallel. Default value: OrderedReady.
- updataStrategy: the policy that defines how pods are upgraded. This field is a structure and will be described in detail later.
- revisionHistoryLimit: the maximum number of historical ControllerRevisions that can be kept. Default value: 10. Only ControllerRevisions not associated with any pods can be deleted.
As shown in the right part of the preceding figure, the StatefulSetUpdateStrategyType field can be set to RollingUpdate or OnDelete.
- RollingUpdate: Pods are upgraded in rolling upgrade mode, which is similar to the upgrade method used by Deployments.
- OnDelete: Pods are upgraded only when deleted. StatefulSet controllers do not actively upgrade existing pods. For example, assume that a StatefulSet has three pods of the old version and that updateStrategy is set to OnDelete. When the image in spec is updated, the StatefulSet controller does not upgrade the pods to the new version one by one. Instead, when replicas are scaled down, the StatefulSet controller deletes the pods. When replicas are scaled up later, the StatefulSet controller creates pods of the new version.
RollingUpdateStatefulSetStrategy has the Partition field to indicate the number of pods that keep the old version during the rolling upgrade. This is different from the number of pods that are upgraded to the new version during the grayscale upgrade.
For example, assume that a StatefulSet has 10 replicas and that the Partition field is set to 8. In this case, eight pods keep the old version, and the other two pods are upgraded to the new version during the grayscale upgrade. When there are 10 replicas, the pod ordinals fall in the range [0, 9). When Partition is set to 8, the eight pods in the ordinal range [0,7) keep the old version, and the pods in the ordinal range [8, 9) are upgraded to the new version.
Assume replicas = N and Partition = M (M < N). The pods that keep the old version fall in the ordinal range [0, M), and the pods upgraded to the new version fall in the ordinal range [M, N). The Partition field can be used for the grayscale upgrade, which is currently not supported by Deployments.
Let’s summarize what we have learned in this article:
- StatefulSets are a common type of workload in Kubernetes intended for stateful application deployment, but they can also be used to deploy stateless applications.
- StatefulSets directly operate on pods to scale and publish them, which is different from Deployments that use ReplicaSets and other workloads for pod management.
- A StatefulSet assigns an exclusive PVC and a unique hostname to each of its managed pods, which can reuse the original PVC and hostname after upgrades.