By Sun Zhiheng (Huizhi), Development Engineer at Alibaba Cloud
Kubernetes Persistent Storage: Basic Concepts
Before explaining the Kubernetes storage process, let’s review the basic concepts of persistent storage in Kubernetes.
- In-tree: The code logic is in the Kubernetes repository.
- Out-of-tree: The code logic is outside the Kubernetes repository and decoupled from the Kubernetes code.
- PersistentVolume (PV): This is a cluster-level resource created by the cluster administrator or external provisioner. The lifecycle of a PV is independent of the pods that use the PV. The .Spec of the PV stores details about storage devices.
- PersistentVolumeClaim (PVC): This is a namespace-level resource that is created by a user or StatefulSet controller based on VolumeClaimTemplate. A PVC is similar to a pod. Pods consume node resources and PVCs consume PV resources. Pods may request specific levels of resources (CPU and memory). PVCs may request the size and access mode of a specific volume.
- StorageClass: This is a cluster-level resource created by the cluster administrator. A StorageClass provides administrators with a class template used to dynamically provision volumes. The .Spec of the StorageClass defines the different quality of service (QoS) and backup policies of PVs.
- CSI: This is an interface compliant with industry standards, which allows storage providers (SPs) to work in different container orchestration (CO) systems by using CSI-based plug-ins. CO systems include Kubernetes, Mesos, and Swarm.
- PV Controller binds PVs and PVCs and manages their lifecycles. It also performs the Provision and Delete operations on data volumes as needed.
- AD Controller performs the Attach and Detach operations on data volumes, and attaches devices to target nodes.
- Kubelet is the main node agent running on each node. It manages pod lifecycles, checks container health, and monitors containers.
- Volume Manager is a component of the kubelet. It performs the Mount, Unmount, Attach, and Detach operations on data volumes. These operations require specific parameter settings of the kubelet. It also formats volume devices.
- Volume Plugins is a storage plug-in developed by storage vendors. It is used to expand the volume management capabilities of various storage classes and implement the operation capabilities of third-party storage- the preceding operations highlighted in blue. Volume Plugins includes in-tree and out-of-tree.
- External Provisioner is a sidecar container that calls the CreateVolume and DeleteVolume functions of Volume Plugins to perform the Provision and Delete operations. The Kubernetes PV Controller cannot directly call the functions of Volume Plugins. These functions are called by External Provisioner through gRPC.
- External Attacher is a sidecar container that calls the ControllerPublishVolume and ControllerUnpublishVolume functions of Volume Plugins to perform the Attach and Detach operations. The Kubernetes AD Controller cannot directly call the functions of Volume Plugins. These functions are called by External Attacher through gRPC.
3) How to Use PVs
Kubernetes introduces PVs and PVCs to allow applications and developers to request storage resources properly without concerning storage device details. Use one of the following ways to create a PV:
- A cluster administrator manually and statically creates the PV required by an application.
- A user manually creates a PVC and the Provisioner component dynamically creates the corresponding PV.
Let’s use the shared storage of a network file system (NFS) as an example to explain the differences between the two PV creation methods.
Statically Create a PV
The following figure shows the process of statically creating a PV.
Step 1) A cluster administrator creates an NFS PV. NFS is a type of in-tree storage natively supported by Kubernetes. The YAML file is as follows:
Step 2) A user creates a PVC. The YAML file is as follows:
Run the kubectl get pv command to check that the PV and PVC are bound.
[root@huizhi ~]# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pvc Bound nfs-pv-no-affinity 10Gi RWO 4s
Step 3) The user creates an application and uses the PVC created in Step 2.
- image: nginx:alpine
- mountPath: /data
- name: nfs-volume
The NFS remote storage is mounted to the
/data directory of the NGINX container in the pod.
Dynamically Create a PV
To dynamically create a PV, ensure that the cluster is deployed with an NFS client provisioner and the corresponding StorageClass.
Compared with static PV creation, dynamic PV creation requires no intervention from the cluster administrator. The following figure shows the process of dynamically creating a PV.
The cluster administrator only needs to ensure that the environment contains an NFS-related StorageClass.
Step 1) The user creates a PVC and sets storageClassName to the name of the NFS-related StorageClass.
Step 2) The NFS client provisioner in the cluster dynamically creates the corresponding PV. A PV is created in the environment and bound to the PVC.
[root@huizhi ~]# kubectl get pv
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE
pvc-dce84888-7a9d-11e6-b1ee-5254001e0c1b 10Mi RWX Delete Bound default/nfs 4s
Step 3) The user creates an application and uses the PVC created in Step 2. This step is the same as Step 3 of statically creating a PV.
Process of Kubernetes Persistent Storage
The following figure shows the process of Kubernetes persistent storage. This figure is taken from the cloud native storage courses given by Junbao.
Let’s take a look at the steps involved in the process.
1) A user creates a pod that contains a PVC, which uses a dynamic PV.
2) Scheduler schedules the pod to an appropriate worker node based on the pod configuration, node status, and PV configuration.
3) PV Controller watches that the pod-used PVC is in the Pending state and calls Volume Plugin (in-tree) to create a PV and PV object. The out-of-tree process is implemented by External Provisioner.
4) AD Controller detects that the pod and PVC are in the ‘To Be Attached’ state and calls Volume Plugin to attach the storage device to the target worker node.
5) On the worker node, Volume Manager in the kubelet waits until the storage device is attached and uses Volume Plugin to mount the device to the global directory `/var/lib/kubelet/pods/[pod uid]/volumes/kubernetes.io~iscsi/[PV
name]` (iscsi is used as an example).
6) The kubelet uses Docker to start the containers in the pod and uses the bind mount method to map the volume that is mounted to the local-global directory to the containers.
The following diagram shows a detailed process:
2) Process Explanation
The persistent storage process varies slightly depending on different Kubernetes versions. This article uses Kubernetes 1.14.8 as an example.
The preceding process map shows the three stages from when a volume is created to when it is used by applications: Provision/Delete, Attach/Detach, and Mount/Unmount.
PV Controller Workers
- ClaimWorker processes the Add, Update, and Delete events of PVCs and the status changes of PVCs.
- VolumeWorker processes the status changes of PVs.
PV Status Changes (UpdatePVStatus)
- The PV starts in the Available state and changes to the Bound state after being bound to the PVC.
- The PV changes to the Released state after the bound PVC is deleted.
- The PV changes to the Available state when the PV reclaim policy is Recycled or the .Spec.ClaimRef of the PV is manually deleted.
- The PV changes to the Failed state when the PV reclaim policy is unknown, the recycle operation fails, or the volume cannot be deleted.
- The PV changes to the Available state when the .Spec.ClaimRef of the PV is manually deleted.
PVC Status Changes (UpdatePVCStatus)
- The PVC changes to the Pending state when the cluster does not include any PV that matches the PVC. The PVC changes from the Pending to the Bound state after it is bound to a PV.
- The PVC changes to the Lost state when the bound PV is deleted from the environment.
- The PVC changes to the Bound state after it is bound to a PV with the same name as that of the previous PV.
Provisioning Process (Assuming a User Creates a PVC)
Static volume process (FindBestMatch): PV Controller selects a PV in the Available state in the environment to match to the new PVC.
- DelayBinding: PV Controller determines whether to delay PVC binding. First, the PV Controller checks whether the PVC’s annotation includes
volume.kubernetes.io/selected-node. If yes, the PVC is scheduled to a node by the scheduler (the PVC belongs to ProvisionVolume). In this case, binding is not delayed. Secondly, if the PVC's annotation does not include
volume.kubernetes.io/selected-nodeand no StorageClass exists, binding is not delayed. If a StorageClass exists, PV Controller checks the VolumeBindingMode field. If it is set to WaitForFirstConsumer, binding is delayed. If it is set to Immediate, binding is not delayed.
- FindBestMatchPVForClaim: PV Controller tries to find a PV in the environment that matches the PVC. PV Controller traverses all PVs and selects the optimal PV among the candidate PVs. Filter rules: 1) PV Controller checks whether VolumeMode is matched. 2) PV Controller checks whether the PV has been bound to the PVC. 3) PV Controller checks whether the Status Phase of the PV is Available. 4) PV Controller uses LabelSelector to check whether the PV and PVC have the same label. 5) PV Controller checks whether the PV and PVC have the same StorageClass. 6) PV Controller updates the smallest PV that meets the PVC requested size in each iteration and returns it as the final result.
- Bind: PV Controller binds the selected PV to the PVC. 1. The .Spec.ClaimRef of the PV is updated to the current PVC. 2. The .Status.Phase of the PV is updated to Bound. 3. The annotation
pv.kubernetes.io/bound-by-controller: "yes"is added to the PV. 4. The .Spec.VolumeName of the PVC is updated to the name of the PV. 5. .Status.Phase of the PVC is updated to Bound. 6. The annotations
pv.kubernetes.io/bind-completed: "yes"are added to the PVC.
Dynamic volume process (ProvisionVolume): The dynamic provisioning process is initiated if no appropriate PV exists in the environment.
- Before Provisioning: 1) PV Controller determines whether the StorageClass used by the PVC is in-tree or out-of-tree. Therefore, PV Controller checks whether the Provisioner field of the StorageClass contains the
kubernetes.io/prefix. 2) PV Controller updates the PVC's annotation as follows:
claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner.
- In-tree Provisioning (Internal Provisioning): 1) The in-tree provisioner implements the NewProvisioner method of the ProvisionableVolumePlugin interface to return a new provisioner. 2) PV Controller calls the Provision function of the provisioner to return a PV object. 3) PV Controller creates the returned PV object and binds it to the PVC. Spec.ClaimRef is set to PVC, .Status.Phase is set to Bound, and .Spec.StorageClassName is set to the StorageClassName that is the same as the name of the PVC’s StorageClass. The following annotations are added:
"pv.kubernetes.io/bound-by-controller"="yes" and "pv.kubernetes.io/provisioned-by"=plugin.GetPluginName()".
- Out-of-tree Provisioning (External Provisioning): 1) External Provisioner checks whether claim.Spec.VolumeName in the PVC is empty. If not, the PVC is skipped. 2) External Provisioner checks whether
claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"]in the PVC is the same as its provisioner name. External Provisioner passes in the --provisioner parameter to determine its provisioner name upon startup. 3) If VolumeMode of the PVC is set to Block, External Provisioner checks whether it supports block devices. 4) External Provisioner calls the Provision function and calls the CreateVolume interface of the CSI storage plug-in through gRPC. External Provisioner creates a PV to represent the volume and binds the PV to the PVC.
The deleting volumes process is the reverse of the provisioning volumes process.
When a user deletes a PVC, PV Controller changes
PV.Status.Phase to Released.
When PV.Status.Phase is set to Released, PV Controller checks the value of Spec.PersistentVolumeReclaimPolicy. If it is set to Retain, it is skipped. If it is set to Delete, then either of the following options is executed:
- In-tree Deleting: 1) The in-tree provisioner implements the NewDeleter method of the DeletableVolumePlugin interface to return a new deleter. 2) PV Controller calls the Delete function of the deleter to delete the corresponding volume. 3) PV Controller deletes the PV object after the volume is deleted.
- Out-of-tree Deleting: 1) External Provisioner calls the Delete function and calls the DeleteVolume interface of the CSI plug-in through gRPC. 2) External Provisioner deletes the PV object after the volume is deleted.
Both the kubelet and AD Controller perform the Attach and Detach operations. These operations are performed when kubelet if — enable-controller-attach-detach is specified in the startup parameters of the kubelet. Otherwise, these operations are performed by AD Controller. The following section explains the Attach and Detach operations using AD Controller as an example.
Two Core Variables of AD Controller
- DesiredStateOfWorld (DSW) indicates the expected volume attachment status in the cluster, including information about nodes->volumes->pods.
- ActualStateOfWorld (ASW) indicates the actual volume attachment status in the cluster, including information about volumes->nodes.
The Attaching Process
AD Controller initializes DSW and ASW based on the resource information in the cluster.
AD Controller has three components that periodically update DSW and ASW.
- The Reconciler component runs a GoRoutine periodically to ensure that the volume is attached or detached. During this period, ASW is continuously updated.
- In-tree Attaching: 1) The in-tree attacher implements the NewAttacher method of the AttachableVolumePlugin interface to return a new attacher. 2) AD Controller calls the Attach function of the attacher to attach the device. 3) ASW is updated.
- Out-of-tree Attaching: 1) The in-tree CSIAttacher is called to create a VolumeAttachement (VA) object, which contains the attacher information, node name, and information about the PV to be attached. 2. External Attacher watches VolumeAttachement resources in the cluster. If there are data volumes to be attached, External Attacher calls the Attach function and calls the ControllerPublishVolume interface of the CSI plug-in through gRPC.
- The DesiredStateOfWorldPopulator component runs a GoRoutine periodically to update DSW.
- FindAndRemoveDeletedPods traverses all the pods in DSW. Any pods that have been deleted from the cluster are removed from DSW.
- FindAndAddActivePods traverses the pods in all PodListers. Any pods that do not exist in DSW are added to DSW.
- The PVC Worker component watches the Add and Update events of PVCs, processes PVC-related pods, and updates DSW in real-time.
- When a pod is deleted, the AD Controller watches this event. It checks whether the node where the pod is located contains the
volumes.kubernetes.io/keep-terminated-pod-volumeslabel. If yes, no operations are performed. If no, the volume is removed from DSW.
- AD Controller uses Reconciler to transfer the ASW status to the DSW status. The Detach operation is performed if ASW contains any volume that does not exist in DSW.
a) In-tree Detaching: 1) AD Controller implements the NewDetacher method of the AttachableVolumePlugin interface to return a new detacher. 2) AD Controller calls the Detach function of the detacher to perform the Detach operation on the volume. 3) AD Controller updates ASW.
b) Out-of-tree Detaching: 1) AD Controller calls the in-tree CSIAttacher to delete the related VolumeAttachement object. 2) External Attacher watches the VolumeAttachement (VA) resource in the cluster. If a data volume needs to be deleted, External Attacher calls the Detach function and calls the ControllerUnpublishVolume interface of the CSI plug-in through gRPC. 3) AD Controller updates ASW.
It has two core variables:
- DesiredStateOfWorld (DSW) indicates the expected volume mount status in the cluster, including information about volumes->pods.
- ActualStateOfWorld (ASW) indicates the actual volume mount status in the cluster, including information about volumes->pods.
The mounting and unmounting processes are as follows:
The global directory (global mount path) is a block device mounted to the Linux system only once. In Kubernetes, a PV may be mounted to multiple pod instances on a node. A formatted block device is mounted to a temporary global directory on a node. Then, the global directory is mounted to the corresponding directory of the pod by using the bind mount technology of Linux. In the preceding process map, the global directory is
VolumeManager initializes DSW and ASW based on resource information in the cluster.
VolumeManager has two components that periodically update DSW and ASW.
- DesiredStateOfWorldPopulator periodically runs a GoRoutine to update DSW.
- Reconciler periodically runs a GoRoutine to ensure that a volume is mounted or unmounted. During this period, ASW is continuously updated.
UnmountVolumes ensures that the volumes are unmounted after the pod is deleted. All the pods in ASW are traversed. If any pod is not in DSW (indicating this pod has been deleted), the following operations are performed (VolumeMode=FileSystem is used as an example):
1) Remove all bind-mounts by calling the TearDown interface of Unmounter, or calling the NodeUnpublishVolume interface of the CSI plug-in in out-of-tree mode.
2) Unmount volume by calling the UnmountDevice function of DeviceUnmounter, or calling the NodeUnstageVolume interface of the CSI plug-in in out-of-tree mode.
3) ASW is updated.
MountAttachVolumes ensures that the volumes to be used by the pod are successfully mounted. All the pods in DSW are traversed. If any pod is not in ASW (the directory is to be mounted and mapped to the pod), the following operations are performed (VolumeMode=FileSystem is used as an example):
1) Wait until the volume is attached to the node by External Attacher or the kubelet.
2) Mount the volume to the global directory by calling the MountDevice function of DeviceMounter or calling the NodeStageVolume interface of the CSI plug-in in out-of-tree mode.
3) Update ASW if the volume is mounted to the global directory.
4) Mount the volume to the pod through bind-mount by calling the SetUp interface of Mounter or calling the NodePublishVolume interface of the CSI plug-in in out-of-tree mode.
5) Update ASW.
UnmountDetachDevices ensures that volumes are unmounted. All UnmountedVolumes in ASW are traversed. If any UnmountedVolumes do not exist in DSW (indicating these volumes are no longer used), the following operations are performed:
1) Unmount volume by calling the UnmountDevice function of DeviceUnmounter, or calling the NodeUnstageVolume interface of the CSI plug-in in out-of-tree mode.
2) ASW is updated.
This article introduces the basics and usage of Kubernetes persistent storage and analyzes the internal storage process of Kubernetes. In Kubernetes, all storage types require the preceding processes, but the Attach and Detach operations are not performed in certain scenarios. Any storage problem in an environment can be attributed to a fault in one of these processes.
Container storage is complex, especially in private cloud environments. However, through this process, it’s possible to seize more opportunities while braving more challenges. Currently, competition is fierce in the storage landscape of China’s private cloud market. Our agile PaaS container team is always looking for talented professionals to join us and help build a better future.