Cloud-Native Storage: Container Storage and Kubernetes Storage Volumes

By Kan Junbao (Junbao), Alibaba Cloud Senior Technical Expert

This series of articles on cloud-native storage explains the concepts, features, requirements, principles, usage, and cases of cloud-native storage. It aims to explore the new opportunities and challenges of cloud-native storage technology. This article is the second in the series and explains the concepts of container storage. If you are not familiar with this concept, I suggest that you read the first article of this series, “Cloud-Native Storage: The Cornerstone of Cloud-Native Applications.”

Docker storage volumes and Kubernetes storage volumes are essential for cloud-native storage.

  • A Docker storage volume is a form of storage organization for container services on a single node. It is based on data storage and container runtime technologies.
  • A Kubernetes storage volume is designed for storage orchestration in container clusters. It focuses on application-specific storage services.

Docker Storage

1. Container Read/Write Layer

Each layer of a container image is read-only so that image data can be shared by multiple containers. In practice, when you start a container by using an image, you can read and write this image in the container. How is this done?

When a container uses an image, the container adds a read/write layer at the top of all image layers. Each running container mounts a read/write layer on top of all layers of the current image. All operations on the container are completed at this layer. When the container is released, the read/write layer is also released.

As shown in the preceding figure, three containers exist on the node. Container 1 and Container 2 run based on Image 1, and Container 3 runs based on Image 2.

The image storage layers are explained as follows:

  • The node contains six image layers from Layer 1 to Layer 6.
  • Image 1 consists of Layer 1, Layer 3, Layer 4, and Layer 5.
  • Image 2 consists of Layer 2, Layer 3, Layer 5, and Layer 6.

The two images share Layer 3 and Layer 5.

Container storage is explained as follows:

  • Container 1 is started by using Image 1.
  • Container 2 is started by using Image 1.
  • Container 3 is started by using Image 2.
  • Container 1 and Container 2 share Image 1. Each container has an independent writable layer. Container 3 shares Layer 3 and Layer 5 with Container 1 and Container 2.

Data sharing based on the layered structure of container images can significantly reduce the host storage usage by the container service.

In the container image structure with the read/write layer, data is read and written in the following way:

In the case of data read, when different layers contain duplicate data, data at the lower layer is overwritten by the same data at the upper layer.

Data is written at the uppermost read/write layer when you modify a file in a container. The technologies involved are copy-on-write (CoW) and allocate-on-demand.

(1) CoW

(2) Allocate-on-demand

2. Storage Drivers

  • AUFS
  • OverlayFS
  • Devicemapper
  • Btrfs
  • ZFS

The following section explains how AUFS works.

AUFS is a type of union file system (UFS) and a file-level storage driver.

AUFS is a layered file system able to transparently superimpose one or more existing file systems to form a single layer. AUFS can mount different directories to the file systems under the same virtual file system.

You can superimpose and modify files layer by layer. Only the file system at the uppermost layer is writable, whereas the file systems at lower layers are read-only.

When you modify a file, AUFS creates a copy of this file and uses CoW to transfer this copy from the read-only layer to the writable layer for modification. The modified file is stored at the writable layer.

In Docker, the uppermost writable layer is the container runtime, and all the lower layers are image layers.

3. Docker Data Volumes

Containers store data temporarily. The stored data is deleted when containers are released. After you mount external storage to a container file system by using a data volume, the application can reference external data or persistently store its generated data in the data volume. Therefore, container data volumes provide a method for data persistence.

Container storage consists of multiple read-only layers (image layers), a read/write layer, and external storage (data volume).

Container data volumes can be divided into single-node data volumes and cluster data volumes. A single-node data volume is a data volume that the container service mounts to a node. Docker volumes are typical single-node data volumes. Cluster data volumes provide cluster-level data volume orchestration capabilities. Kubernetes data volumes are typical cluster data volumes.

A Docker volume is a directory that can be used by multiple containers simultaneously. It is independent of the UFS and provides the following features:

  • Data volumes can be shared and reused among containers.
  • Storage drivers only support write operations at the writable layer, whereas data volumes support direct read and write operations on external storage, which is more efficient.
  • Data volumes are updated by reading and writing external storage. This does not affect images and the read/write layer of containers.
  • Data volumes can exist until they are no longer used by containers.

(1) Types of Docker Data Volumes

  • Mount operations only use absolute paths on the host. Host directories can be automatically created.
  • You can modify any files in container-mounted directories. This makes applications easier to use but also introduces security threats.

Volume: You can enable this mode when you use third-party data volumes.

  • Volume command on the command-line interface (CLI): docker volume (create/rm)
  • The Volume mode is provided by Docker, so it cannot be used in non-Docker environments.
  • In Volume mode, data volumes are divided into named volumes and anonymous volumes. The only difference between them is that the names of anonymous volumes are random codes.
  • Data volume drivers can be extended to support access by more types of external storage.

Tmpfs is a non-persistent volume type, which stores data in the memory. Tmpfs data is easy to lose.

(2) Syntax for Mounting in Bind Mode

  • “src” specifies a volume mapping source, host directory, or file. It must be an absolute address.
  • “dst” specifies the destination address inside the container for the mount operation.
  • (Optional) “opts” specifies a mount attribute. The options include ro, consistent, delegated, cached, z, and Z.
  • The options “consistent”, “delegated”, and “cached” are used to configure shared mount propagation in macOS.
  • The options “Z” and “z” are used to configure SELinux labels for host directories.

Example:

$ docker run -d --name devtest -v /home:/data:ro,rslave nginx
$ docker run -d --name devtest --mount type=bind,source=/home,target=/data,readonly,bind-propagation=rslave nginx
$ docker run -d --name devtest -v /home:/data:z nginx

(3) Syntax for Mounting in Volume Mode

  • “src” specifies a volume mapping source. It can be set to the name of a data volume or left empty.
  • “dst” specifies the destination directory inside the container.
  • (Optional) “opts” specifies a mount attribute. The option “ro” specifies read-only.

Example:

$ docker run -d --name devtest -v myvol:/app:ro nginx
$ docker run -d --name devtest --mount source=myvol2,target=/app,readonly nginx

4. Usage of Docker Data Volumes

(1) Volume Type

By default, the directory /var/lib/docker/volumes/{volume-id}/_data is created on the host for mapping purposes.

Named data volumes: docker run –d -v nas1:/data3 nginx

If the nas1 volume cannot be found, a volume of the default type (local) is created.

(2) Bind Mode

docker run -d -v /test:/data nginx

If the host does not contain the /test directory, this directory is created by default.

(3) Data Volume Containers

docker run -d --volumes-from nginx1 -v /test1:/data1 nginx

The preceding command is used to inherit all data volumes from a configured container, including custom volumes.

(4) Data Volume Mount Propagation

  • Private: Mounts are not propagated. The mounts in the source and destination directories are not propagated to each other.
  • Shared: Mounts are propagated between the source and destination directories.
  • Slave: Mounts of the source object can be propagated to the destination object, but not vice versa.
  • Rprivate: This implements Private recursion, which is the default mode.
  • Rshared: This implements Shared recursion.
  • Rslave: This implements Slave recursion.

Examples:

$ docker run –d -v /home:/data:shared nginx
The directories mounted to the /home directory of the host are available in the /data directory of the container, and vice versa.
$ docker run –d -v /home:/data:slave nginx
The directories mounted to the /home directory of the host are available in the /data directory of the container, but not vice versa.

(5) Visibility of Data Volume Mounts

  • Empty local directories and empty image directories: No special operations are required.
  • Empty local directories and non-empty image directories: Copy the content of image directories to the host. The content is retained when the container is deleted.
  • Non-empty local directories and empty image directories: Map the content of local directories to the container.
  • Non-empty local directories and non-empty image directories: Map the content of local directories to the container. The content of container directories is hidden.

Mount visibility in Bind mode: This is determined by host directories.

  • Empty local directories and empty image directories: No special operations are required.
  • Empty local directories and non-empty image directories: The container directories become empty.
  • Non-empty local directories and empty image directories: Map the content of local directories to the container.
  • Non-empty local directories and non-empty image directories: Map the content of local directories to the container. The content of container directories is hidden.

5. Docker Data Volume Plug-ins

  • Multiple storage plug-ins can be deployed on a single node.
  • A storage plug-in manages the mounting service of a specific storage class.

The Docker daemon communicates with volume drivers in the following ways:

  • Sock file: stored in the /run/docker/plugins directory in Linux
  • Spec file: defined by /etc/docker/plugins/convoy.spec
  • JSON file: defined by /usr/lib/docker/plugins/infinit.json
  • Interfaces: Create, Remove, Mount, Path, Umount, Get, List, Capabilities;

Example:

$ docker volume create --driver nas -o diskid="" -o host="10.46.225.247" -o path="/nas1" -o mode="" --name nas1

Docker volume drivers can be used to manage data volumes in single-node container environments or on the Swarm platform. Currently, Docker volume drivers are less used because Kubernetes has become increasingly popular. For more information about Docker volume drivers, see https://docs.docker.com/engine/extend/plugins_volume/

Kubernetes Storage Volumes

1. Basic Concepts

(1) Data Volumes

  • A data volume has the same lifecycle as the pod where it resides. When the pod is deleted, the data volume disappears (the data is not deleted) at the same time.
  • Storage details are defined in an orchestration template and are perceived during the application orchestration process.
  • You can define multiple volumes of the same or different storage classes for a load, also known as a pod.
  • Each container of a pod can reference one or more volumes. Different containers can share the same volume.

Common types of Kubernetes volumes include:

  • Local storage: This includes HostPath and emptyDir. These storage volumes store data on specific nodes of the cluster and do not drift with applications. The stored data is unavailable when the nodes are down.
  • Network storage: This includes Ceph, Glusterfs, network file system (NFS), and Internet Small Computer System Interface (iSCSI). These storage volumes store data through remote storage services. When you use these storage volumes, you need to mount storage services locally.
  • Secret and ConfigMap: These storage volumes store the object information of a cluster and do not belong to any nodes. Object data is mounted as volumes to nodes for use by applications.
  • Container Storage Interface (CSI) and FlexVolume: These are two extensions of data volumes and can be viewed as a type of abstract data volume. Each extension can be divided into different storage classes.
  • Persistent Volume Claim (PVC): This is a mechanism for defining data volumes. A PVC abstracts a data volume into a pod-independent object. The storage information that is defined for or associated with this object is stored in a storage volume and used when Kubernetes loads are mounted.

Examples of volume templates:

volumes:
- name: hostpath
hostPath:
path: /data
type: Directory
---
volumes:
- name: disk-ssd
persistentVolumeClaim:
claimName: disk-ssd-web-0
- name: default-token-krggw
secret:
defaultMode: 420
secretName: default-token-krggw
---
volumes:
- name: "oss1"
flexVolume:
driver: "alicloud/oss"
options:
bucket: "docker"
url: "oss-cn-hangzhou.aliyuncs.com"

(2) PVC and PV

  • Kubernetes storage volumes include PVC, Persistent Volume (PV), and Service Catalog (SC) objects. These objects are independent of application loads, also known as pods, and associated by using an orchestration template.
  • Each Kubernetes storage volume has its own lifecycle, which is independent of the pod lifecycle.

PVCs are a type of abstract storage volume in Kubernetes and represents the data volume of a specific storage class. PVCs are designed to separate storage from application orchestration. A PVC object abstracts storage details and implements storage volume orchestration. This makes storage volume objects independent of application orchestration in Kubernetes and decouples applications from storage at the orchestration layer.

PVs are a specific type of storage volume in Kubernetes. A PV object defines a specific storage class and a set of volume parameters. All information about the target storage service is stored in a PV object. Kubernetes references the PV-stored information for mounting.

The following figure shows the relationships among loads, PVC objects, and PV objects.

A PV object can be used to separate storage from application orchestration and mount data volumes. So what is the purpose of combining PVC and PV objects? Through the combined use of PVC and PV objects, Kubernetes implements secondary abstraction of storage volumes. A PV object describes a specific storage class by defining storage details. Users do not want to have to study underlying details when they use storage services at the application layer. Therefore, it is not a user-friendly practice to define specific storage services at the application orchestration layer. To fix this problem, Kubernetes implements secondary abstraction of storage services. Kubernetes only extracts the parameters related to user relationships, and uses PVC objects to abstract underlying PV objects. Therefore, PVC and PV have different focuses. A PVC object focuses on users’ storage needs and provides a unified way to define storage. A PV object focuses on storage details, allowing users to define specific storage classes and storage mount parameters.

Specifically, the application layer declares a storage need (PVC), and Kubernetes selects the PV object that best fits this PVC object and binds them together. PVCs are a type of storage object used by applications and belong to the application domain. That means the PVC object resides in the same noun space as the application. PVs are a type of storage object that belongs to the storage domain instead of a noun space.

PVC and PV objects have the following attributes:

  • A PVC object is always paired with a PV object. A PVC object must be bound to a PV object before it can be consumed by an application or a pod.
  • One PVC object is bound to only one PV object. One PV object cannot be bound to multiple PVC objects, and one PVC object cannot be bound to multiple PV objects.
  • PVCs are a storage class at the application layer and belongs to a specific noun space.
  • PVs are a storage class at the storage layer. They belong to a cluster instead of a noun space. PV objects are managed by the storage O&M personnel.
  • Pods consume PVC objects and PVC objects consume PV objects. A PV object defines a specific storage medium.

(3) PVC Definition

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: disk-ssd-web-0
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: alicloud-disk-available
volumeMode: Filesystem

The PVC-defined storage interfaces are related to the storage access mode, resource capacity, and volume mode. The main parameters are described as follows:

“accessModes” defines the mode of access to storage volumes. The options include ReadWriteOnce, ReadWriteMany, and ReadOnlyMany.

  • “ReadWriteOnce” specifies that a PVC object can be consumed by one pod at a time through read and write operations.
  • “ReadWriteMany” specifies that a PVC object can be consumed by multiple pods simultaneously through read and write operations.
  • “ReadOnlyMany” specifies that a PVC object can be consumed by multiple pods simultaneously in read-only mode.

Note: The preceding access modes are only declared at the orchestration layer. Whether stored files are readable and writable is determined by specific storage plug-ins.

“storage” defines the storage capacity that the specified PVC object is expected to provide. The defined data size is only declared at the orchestration layer. The actual storage capacity is determined by the type of the underlying storage service.

“volumeMode” defines the mode of mounting storage volumes. The options include FileSystem and Block.

“FileSystem” specifies that data volumes are mounted as file systems for use by applications.

“Block” specifies that data volumes are mounted as block devices for use by applications.

(4) PV Definition

apiVersion: v1
kind: PersistentVolume
metadata:
labels:
failure-domain.beta.kubernetes.io/region: cn-shenzhen
failure-domain.beta.kubernetes.io/zone: cn-shenzhen-e
name: d-wz9g2j5qbo37r2lamkg4
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 30Gi
flexVolume:
driver: alicloud/disk
fsType: ext4
options:
VolumeId: d-wz9g2j5qbo37r2lamkg4
persistentVolumeReclaimPolicy: Delete
storageClassName: alicloud-disk-available
volumeMode: Filesystem
  • “accessModes” defines the mode of accessing storage volumes. The options include ReadWriteOnce, ReadWriteMany, and ReadOnlyMany. These options are the same as for PVC.
  • “capacity” defines the storage volume capacity.
  • “persistentVolumeReclaimPolicy” defines the reclaim policy, that is, how to process a PV object when the bound PVC object is deleted. The options include Delete and Retain. This parameter will be described in the “Dynamic Data Volumes” section.
  • “storageClassName” defines the storage class name used by a storage volume. This parameter will be described in the “Dynamic Data Volumes” section.
  • “volumeMode” has the same meaning as the “volumeMode” parameter for PVC.
  • “Flexvolume” defines a specific abstract storage class. The sub-configuration items define a specific storage class and a set of storage parameters.

(5) PVC-PV Binding

  • VolumeMode: The PV object to be consumed must be in the same volume mode as the PVC object.
  • AccessMode: The PV object to be consumed must be in the same access mode as the PVC object.
  • StorageClassName: If this parameter is defined for a PVC object, only a PV object that has the corresponding parameters defined can be bound to this PVC object.
  • LabelSelector: The appropriate PV object is selected from a PV list through label matching.
  • storage: The PV object to be consumed must have a storage capacity not less than that of the PVC object.

Only a PV object that meets the preceding requirements can be bound to the PVC object.

If multiple PV objects meet requirements, the most appropriate PV object is selected for binding. Generally, the PV object with the minimum capacity is selected. If multiple PV objects have the same minimum capacity, one of them is randomly selected.

If no PV objects meet the preceding requirements, the PVC object enters the pending state until a conforming PV object appears.

2. Static and Dynamic Storage Volumes

Storage volumes are divided into dynamic storage volumes and static storage volumes based on the PV creation method.

  • Static storage volumes are PV objects created by administrators.
  • Dynamic storage volumes are PV objects created by the Provisioner plug-in.

(1) Static Storage Volumes

(2) Dynamic Storage Volumes

Dynamic and static storage volumes are compared as follows:

  • Dynamic and static storage volumes are allocated to pods, PVC objects, and PV objects in sequence and defined by the same object template.
  • Dynamic storage volumes are PV objects automatically created by plug-ins, whereas static storage volumes are PV objects manually created by cluster administrators.

Dynamic storage volumes provide the following advantages:

  • Dynamic volumes allow Kubernetes to implement automatic PV lifecycle management. PV objects are created and deleted by the Provisioner plug-in.
  • PV objects can be created automatically, which simplifies configuration and reduces the workload of system administrators.
  • Dynamic volumes maintain consistency between the PVC-required storage capacity and the PV capacity that is configured by the Provisioner plug-in. This optimizes storage capacity planning.

(3) Implementation Process for Dynamic Volumes

A storage class can be viewed as the template used to create a PV storage volume. When a PVC object triggers the automatic PV creation process, a PV object is created by using the content of a storage class. The content includes the name of the target Provisioner plug-in, a set of parameters used for PV creation, and the reclaim mode.

A storage class template is defined as follows:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: alicloud-disk-topology
parameters:
type: cloud_ssd
provisioner: diskplugin.csi.alibabacloud.com
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
  • “provisioner” specifies the name of a registration plug-in, which is used to create a PV object. A storage class defines only one Provisioner plug-in.
  • “parameters” specifies a set of parameters used to create a data volume. In this example, an SSD-type cloud disk is created.
  • “reclaimPolicy” specifies the value of the persistentVolumeReclaimPolicy field used to create a PV object. The options include Delete and Retain. “Delete” specifies that a dynamically created PV object is automatically released when the bound PVC object is released. “Retain” indicates that a PV object is dynamically created, but must be released by the administrator.
  • “allowVolumeExpansion” specifies whether the PV object created based on the current storage class performs dynamic scale-out. The default value is “false”. This parameter is only used to enable or disable dynamic scale-out. Whether to enable this feature is determined by the underlying storage plug-in.
  • “volumeBindingMode” specifies the time when PV objects are dynamically created. The options include Immediate (immediate creation) and WaitForFirstConsumer (delayed creation).

When you create a PVC declaration, Kubernetes finds a suitable PV object in the cluster to be bound to the created PVC object. If no suitable PV object exists, the following process is triggered:

  • Volume Provisioner watches the existence of this PVC object. If StorageClassName is defined for this PVC object and the Provisioner plug-in defined by the storage class is owned by the PVC object, then the Provisioner plug-in triggers the PV creation process.
  • The Provisioner plug-in creates a PV object based on the PVC-defined parameters, such as Size, VolumeMode, and AccessModes, as well as the storage class-defined parameters, such as ReclaimPolicy and Parameters.
  • The Provisioner plug-in creates a data volume in the storage medium by calling the API or by other means. After a data volume is created, the Provisioner plug-in creates a PV object.
  • The created PV object is bound to the PVC object so that pods can be started.

(4) Delayed Binding of Dynamic Data Volumes

  • A data volume is created in Zone A, but Zone A has no available node resources. As a result, the created volume cannot be mounted to a pod upon startup.
  • When administrators plan PVC and PV objects, they cannot determine in which zones they can create multiple PV objects for backup.

The storage class template provides the volumeBindingMode field to fix the preceding problems. When this field is set to WaitForFirstConsumer, the Provisioner plug-in delays data volume creation when it receives the PVC pending state. Instead, the Provisioner plug-in creates a data volume only after the PVC object is consumed by a pod.

The detailed process is as follows:

  • The Provisioner plug-in delays data volume creation when it receives the PVC pending state. Instead, the Provisioner plug-in creates a data volume only after the PVC object is consumed by a pod.
  • If a pod consumes the PVC object and the scheduler determines that the PVC object enables delayed binding, then the PV scheduling process continues. The scheduler patches the scheduling result to the metadata of the PVC object. Storage scheduling will be described in a later article.
  • When the Provisioner plug-in determines that scheduling information has been written to the PVC object, it retrieves location information such as the zone and node based on the scheduling information to create a data volume. Then, the Provisioner plug-in triggers the creation process.

The delayed binding feature is used to schedule application loads to ensure that sufficient resources are available for use by pods before dynamic volumes are created. This also ensures that data volumes are created in zones with available resources and improves the accuracy of storage planning.

We recommend that you use the delayed binding feature when you create dynamic volumes in a multi-zone cluster. The preceding configuration process is supported by Alibaba Cloud Container Service for Kubernetes (ACK) clusters.

3. Example

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nas-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
selector:
matchLabels:
alicloud-pvname: nas-csi-pv
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: nas-csi-pv
labels:
alicloud-pvname: nas-csi-pv
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
flexVolume:
driver: "alicloud/nas"
options:
server: "***-42ad.cn-shenzhen.extreme.nas.aliyuncs.com"
path: "/share/nas"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-nas
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx1
image: nginx:1.8
- name: nginx2
image: nginx:1.7.9
volumeMounts:
- name: nas-pvc
mountPath: "/data"
volumes:
- name: nas-pvc
persistentVolumeClaim:
claimName: nas-pvc

Template explanation:

  • The preceding application is an NGINX service that is orchestrated in Deployment mode. Each pod includes two containers: nginx1 and nginx2.
  • The template defines the Volumes field to mount data volumes for the application. The data volumes are defined as PVC objects.
  • Within the application, the data volume nas-pvc is mounted to the /data directory of the nginx2 container. No data volume is mounted to the nginx1 container.
  • The PVC object nas-pvc is defined as a storage volume with no less than 50 GB capacity and assigned the ReadWriteOnce permission. PV labeling is required.
  • The PV object nas-csi-pv is defined as a 50 GB storage volume that is assigned the ReadWriteOnce permission, Retain mode, and Flexvolume type. This PV object is configured with a label.

According to the PVC-PV binding logic, this PV object meets the PVC consumption requirements. Therefore, the PVC object is bound to the PV object and mounted to a pod.

Summary

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store