Cloud-Native Storage: The Cornerstone of Cloud-Native Applications
By Junbao, Alibaba Cloud Technical Expert
It is impossible to ignore the significance of storage services as they are an important part of the computer system and the foundation on which we run all applications, maintain the states of applications, and support data persistence. As storage services have evolved, each new business model or technology put forward new requirements for storage architecture, performance, availability, and stability. With the current ascendance of cloud-native technology, we must consider the types of storage services that will be needed to support cloud-native applications?
This is the first in a series of articles that explore the concepts, characteristics, requirements, principles, and usage of cloud-native storage services. These articles will also introduce case studies and discuss the new opportunities and challenges posed by cloud-native storage technology.
“There is no such thing as a ‘stateless’ architecture” — Jonas Boner
This article focuses on the basic concepts of cloud-native storage and common storage solutions.
Basic Concepts of Cloud-Native Storage
To understand cloud-native storage, we must first familiarize ourselves with cloud-native technology. According to the definition given by the Cloud Native Computing Foundation (CNCF):
Cloud-native technology empowers organizations to build and run scalable applications in emerging dynamic environments, such as the public, private, and hybrid clouds. Examples of cloud-native technology include containers, service meshes, microservices, immutable infrastructures, and declarative APIs.
These techniques enable us to build loosely coupled systems that are highly fault-tolerant and easy to manage and observe. When used together with robust automation solutions, cloud-native technology allows engineers to make high-impact changes frequently, predictably, and effortlessly.
In a nutshell, there is no clear dividing line between cloud-native applications and traditional applications. Cloud-native technology describes a technological tendency. Basically, the more of the features below an application offers, the more cloud-native it is:
- Application containerization
- Service meshing
- Declarative APIs
- Elastically scalable operation
- Automated DevOps
- Fault tolerance and auto-recovery
- Portable and platform-independent
A cloud-native application is a collection of application features and capabilities. These features and capabilities, once realized, can greatly optimize the application’s core capabilities, such as availability, stability, scalability, and performance. Outstanding capabilities are the current technological trend. As part of this process, cloud-native applications are transforming various application fields while profoundly changing all aspects of application services. As one of the most basic conditions necessary to run an application, storage is an area where many new requirements have been raised as services have become cloud native.
The concept of cloud-native storage originated from cloud-native applications. Cloud-native applications require storage with matching cloud-native capabilities, or in other words, storage with cloud-native tendencies.
Characteristics of Cloud-native Storage
The availability of a storage system represents users’ ability to access data in the event of system failure caused by storage media, transmission, controllers, or other system components. Availability defines how data can still be accessed when a system fails and how access can be rerouted through other accessible nodes when some nodes become unavailable.
It also defines the recovery time objective (RTO) in the event of failure, which is the amount of time between the occurrence of a failure and the recovery of services. Availability is typically expressed as the percentage, which represents an application’s available time as a proportion of its total running time (for example, 99.9%), with measurements made in the time unit of MTTF (mean time to failure) or MTTR (mean time to repair).
Storage scalability primarily represents the ability to:
- Increase the number of clients that can access the storage system (for example, the number of clients that can be concurrently mounted to a NAS volume);
- Scale up the throughput and I/O performance of individual APIs; and
- Expand the capacity of individual storage service instances, such as the scaling of a cloud disk.
Storage performance is usually measured by two metrics:
- The maximum number of storage operations per second, or input/output operations per second (IOPS); and
- The maximum storage read/write throughput per second, or throughput.
Cloud-native applications are widely used in scenarios such as big data analysis and AI. These high-throughput and large I/O scenarios pose have demanding requirements for storage. In addition, the features of cloud-native applications, such as rapid resizing and extreme scaling test the ability of storage services to cope with sudden traffic peaks.
For storage services, consistency refers to the ability to access new data after it is submitted or existing data is updated. Based on the latency of data consistency, storage consistency can be divided into two types: eventual consistency and strong consistency.
Services vary in their sensitivity to storage consistency. Applications such as databases, which have strict requirements for the accuracy and timeliness of underlying data, require strong consistency.
Multiple factors can influence data persistence:
- The level of system redundancy;
- The durability of storage media (such as SSDs or HDDs); and
- The ability to detect data corruption and use data protection features to rebuild or recover corrupted data.
Data Access Interface
In cloud-native application systems, applications can access storage services in various ways. Storage access can be divided into two modes based on interface type: data volume and API.
Data volume: The storage services are mapped to blocks or a file system for direct access by applications. They can be consumed in a way similar to how applications directly read and write files in a local directory of the operating system. For example, you can mount a block storage or file system storage to the local host and access data volumes just like local files.
API: Some types of storage cannot be mounted and accessed as data volumes, but instead, have to be accessed through API operations. For example, database storage, KV storage, and Object Storage Service (OSS) perform read/write operations through APIs.
Note: OSS is generally used to externally provide file read/write capabilities through RESTful APIs. However, it can also be used by mounting the storage as a user-state file system, just like mounting block or file storage as data volumes.
The following table lists the advantages and disadvantages of each storage interface:
Tiered Cloud-native Storage
(1) Orchestration and Operating System Layer
This layer defines the interfaces for external access to the stored data, namely, the form in which storage is presented when applications access data. As described in the preceding section, access methods can be divided into data volume access and API access. This layer is the key to storage orchestration when developing container services. The agility, operability, and scalability of cloud-native storage are implemented on this layer.
(2) Storage Topology Layer
This layer defines the topology and architectural design of the storage system. It indicated how different parts of the system, such as storage devices, computing nodes, and data, are related and connected. The construction of the topology affects many properties of the storage system, so much thought must be put into its design.
The storage topology can be centralized, distributed, or hyper-converged.
(3) Data Protection Layer
The layer defines how to protect data through redundancy. It is essential for a storage system to protect data through redundancy so that it can recover data in the event of a failure. Generally, users can choose from several data protection solutions:
- Redundant Array of Independent Disks (RAID): This is a technology for distributing data across multiple disks while considering redundancy.
- Erasure coding: Data is divided into multiple segments that are encoded and stored together with multiple redundant data sets to ensure data recoverability.
- Replica: The data set is replicated across multiple servers so that multiple complete replicas of the data set are available.
(4) Data Services
Data services supplement core storage capabilities with additional storage services, such as storage snapshots, data recovery, and data encryption.
The supplementary capabilities provided by data services are precisely what cloud-native storage requires. Cloud-native storage achieves agility, stability, and scalability by integrating a variety of data services.
(5) Physical Layer
This layer defines the actual physical hardware that stores the data. The choice of physical hardware affects the overall system performance and the continuity of stored data.
The cloud-native nature of the storage means it must be containerized. In container service scenarios, a certain management system or application orchestration system is usually required. The interaction between the orchestration system and storage system establishes the correlation between workloads and stored data.
In the above diagram:
- “Load” indicates application instances that consume underlying storage resources.
- “Orchestration System” is a container orchestration system similar to Kubernetes that is used for application management and scheduling.
- “Control Plane Interfaces” are the standard interfaces for scheduling and operating on the underlying storage resources of an orchestration system, such as Flexvolume in Kubernetes and CSI in container storage.
- “Access Tools” are the third-party tools and frameworks that control plane interfaces depend on to operate and maintain storage resources.
- The “Storage System” comprises the control plane and the data plane. The control plane exposes interfaces externally to enable the access to and egress of storage resources. The data plane provides data storage services.
The orchestration system prepares the storage resources defined by the application load. The orchestration system uses the control plane of the storage system by calling the control plane interfaces and implements access and egress operations application loads perform on the storage services. After accessing the storage system, an application load can directly access its data plane, giving it direct access to data.
Common Cloud-Native Storage Solutions
Public Cloud Storage
Every public cloud service provider offers a variety of cloud resources, including various cloud storage services. Take Alibaba Cloud for example. It provides basically any type of storage required by business applications, including object storage, block storage, file storage, and databases, just to name a few. Public cloud storage has the advantages of scale. Public cloud providers operate on a large scale, allowing them to keep prices low while making gigantic investments in R&D and O&M. In addition, public cloud storage can easily meet the needs of businesses for stability, performance, and scalability.
With the development of cloud-native technology, public cloud providers are competing to transform and adapt their cloud services for the cloud-native environment and offer more agile and efficient services to meet the needs of cloud-native applications. Alibaba Cloud’s storage services are also optimized in many ways for compatibility with cloud-native applications. The CSI storage driver implemented by Alibaba Cloud’s container service team seamlessly connects data interfaces across cloud-native applications and the storage services. The underlying storage is imperceptible to users when they consume the storage resources, which allows them to focus on business development.
- High reliability: Most cloud vendors provide services with high stability and outstanding data availability. For example, Alibaba Cloud’s Elastic Block Store (EBS) offers a 99.99999999% reliability, providing a strong and fundamental guarantee of data security.
- High performance: Public clouds offer various levels of storage performance to suit different services, allowing them to meet the storage performance requirements of almost all applications. Alibaba Cloud’s EBS is capable of millions of IOPS, similar to the performance of access to local disks. Apsara File Storage NAS provides a maximum throughput of dozens of Gbit/s, enabling it to meet the rigorous performance requirements of data sharing scenarios. However, a CPFS high-performance concurrent file system can provide a throughput of up to one Tbit/s, enough to satisfy the storage requirements for extremely high-performance computing.
- High scalability: Generally, public cloud storage services are capable of capacity scaling, which allows you to dynamically scale the capacity when an application requires more storage without affecting the application.
- Robust security: Different cloud storage services provide data security protection mechanisms that use encryption technology, such as KMS or AES, to encrypt and store data. They also implement link encryption solutions from the client to services, so that data transmission is also encryption-protected.
- Mature cloud-native storage interfaces: The cloud-native storage APIs are compatible with all types of storage, allowing the applications to seamlessly access different storage services. The CSI driver provided by Alibaba Cloud’s Container Service supports cloud disks, OSS, NAS, local disks, memory, LVM, and other storage types, allowing applications to seamlessly access any type of storage service.
- Zero maintenance: Compared with user-created storage services, public cloud storage solutions save users the trouble of having to perform O&M.
- Poor customization: As public cloud storage solutions need to satisfy the needs of all user scenarios, their capabilities are designed to meet general needs, rather than the personalized needs of specific users.
Commercial Cloud Storage
In many private cloud environments, business users purchase commercial storage services to achieve high data reliability. These solutions provide users with highly available, highly efficient, and convenient storage services, and guarantee O&M services and post-production support. Private cloud storage providers, as they become gradually aware of the popularity of cloud-native applications, are providing users with comprehensive and mature cloud-native storage interface implementations.
- Robust security: Deployment in a private cloud can securely physically isolate data.
- High reliability and high performance: Many cloud storage providers have dedicated years of work to storage technology and possess outstanding technical and O&M capabilities. Their commercial storage services can meet the performance and reliability requirements of most applications.
- Cloud-native storage interfaces: The open-source projects launched by various storage service providers indicate that they already support or are starting to support cloud-native applications.
- High cost: Most commercial storage services are very expensive.
- Compatibility of cloud-native storage interfaces: Commercial cloud-native storage APIs are usually specific to only one storage type. Most users use a diversity of storage, but if they have to use various storage services at the same time, it is very difficult to achieve unified storage access.
User-created Storage Service
Many companies choose to build their own storage services for business data with low service level requirements. Business users can choose among currently available open-source storage solutions based on their business needs.
- File storage: Available solutions include CephFS, GlusterFS, and NFS. The technical maturity of CephFS and GlusterFS requires further verification, and their capabilities are insufficient for high reliability and high performance scenarios. Although NFS is mature, its performance in user-created clusters cannot meet the needs of high-performance applications.
- Block storage: Common block storage solutions, such as RBD and SAN, are relatively mature technologies and used by many companies in their own services. However, they are quite complex and require a dedicated team to support and maintain them.
- High flexibility and can be matched to business needs: Users can choose among many open-source solutions and use the one most suitable for their business needs. Then, they can conduct secondary development on the native code to optimize the solution for their specific business scenarios.
- Robust security: If a user-created storage service is used within a company, it can provide secure physical isolation.
- Cloud-native storage interfaces: Almost all common open-source storage solutions can be implemented by using the cloud-native storage interfaces from the developer community. These interfaces allow users to further develop and optimize the solutions.
- Weak performance: Most open-source storage solutions have weak native performance. Of course, the solutions can be optimized through architecture design, physical hardware upgrades, and secondary development.
- Poor reliability: Open-source storage solutions are not comparable to commercial storage in terms of reliability, so they are more frequently used in data storage scenarios with low service level requirements.
- A myriad of cloud-native storage plug-ins: Currently, there are many versions of open-source cloud-native storage driver available online, with widely varying quality. Some projects have bugs and have been left unmaintained for a long time. Therefore, uses must take time to identify appropriate plug-ins and fine-tune them.
- Professional team support: Users are themselves responsible for operating and maintaining the user-created services. When they use less mature open-source solutions, they must create a team of highly skilled professionals to operate, maintain, and develop the storage system.
Some businesses do not need highly available distributed storage services. Instead, they prefer local storage solutions due to their high performance.
Database services: If users need to achieve high storage I/O performance and low access latency, common block storage services cannot effectively satisfy their needs. In addition, if their applications are designed for high data availability and do not need to retain multiple replicas at the underlying layer, the multi-replica design of distributed storage is a waste of resources.
Storage as a cache: Some users expect their applications to save unimportant data, which can be discarded after programs are executed. This also requires high storage performance. Essentially storage is being used as a cache. The high availability of cloud disks does not make much sense for such services. In addition, cloud disks don’t have advantages over local storage in terms of performance and cost.
Therefore, although local disk storage is much weaker than distributed block storage in many key aspects, it still maintains a competitive edge in specific scenarios. Alibaba Cloud provides a local disk storage solution based on NVMe. Its superior performance and lower pricing make it popular among users for specific scenarios.
Alibaba Cloud CSI drivers can be used by cloud-native applications to access local storage and support multiple access methods, such as lvm volumes, raw device access to local disks, and local directory mapping. CSI drivers can be used to implement data access adapted for high performance access, quota operations, and IOPS configuration, among others.
- High performance: This solution supports higher IOPS and throughput relative to distributed storage.
- Low price: Local disks can be directly provided through raw devices, which is a lower-cost solution than distributed storage with multiple replicas.
- Poor data reliability: Data stored on local disks cannot be recovered after it is lost, so users must implement a high-availability data architecture on the application layer.
- Poor flexibility: Data cannot be migrated to other nodes as on a cloud disk.
Open-source Container Storage
With the development of cloud-native technology, the developer community has released some open-source cloud-native storage solutions.
Rook, as the first CNCF storage project, is a cloud-native storage solution that integrates distributed storage systems, such as Ceph and Minio. It is designed for simple deployment and management, deeply integrated with the container service ecosystem, and provides various features for adaption to cloud-native applications. In terms of implementation, Rook can be seen as an operator that provides Ceph cluster management capabilities. It uses CRD to deploy and manage storage resources such as Ceph and Minio.
- Operator: This component is used to automatically launch the storage cluster and monitor the storage daemon to ensure the health of the storage cluster.
- Agent: The agent component runs on each storage node and deploys a CSI/FlexVolume plug-in for integration with the Kubernetes volume control framework. The agent processes all storage operations, including mounting storage devices, loading storage volumes, and formatting file systems.
- Discovers: This component detects storage devices attached to a storage node.
Rook deploys the Ceph storage service as a service in Kubernetes. The daemon processes, such as MON, OSD, and MGR, are deployed as pods in Kubernetes, while the core components of Rook perform O&M and management operations on the Ceph cluster.
Through Ceph, Rook externally provides comprehensive storage capabilities and supports object, block, and file storage services, allowing users to use multiple storage services with one system. Finally, the default implementation of the cloud-native storage interface in Rook uses the CSI/Flexvolume driver to connect application services to the underlying storage. As Rook was initially designed to serve the needs of the Kubernetes ecosystem, it features strong adaption for containerized applications.
Official Rook documentation: https://rook.io/
OpenEBS is an open-source implementation that emulates the functions of block storage such as AWS EBS and Alibaba Cloud disks. OpenEBS is a container solution based on container attached storage (CAS). It adopts a microservices model as storage and applications do and orchestrates resources through Kubernetes. In terms of its architecture, the controller of each volume is a separate pod and shares the same node with the application pod. The volume data is managed through multiple pods.
The architecture comprises the data plane and control plane.
- The data plane provides data storage for applications.
- The control plane manages OpenEBS containers. This usually involves the functions of Container Orchestration software.
OpenEBS persistence volumes (PV) are created based on the PVs in Kubernetes and implemented through iSCSI. Data is stored on nodes or in cloud storage. OpenEBS volumes are managed independent of the application life cycle, similar to the PVs in Kubernetes.
OpenEBS volumes provide persistent storage for containers, elasticity to cope with system failures, and faster access to storage, snapshots, and backups. It also provides a mechanism to monitor usage and execute QoS policies.
The OpenEBS control plane, Maya, has implemented hyper-converged OpenEBS, which can extend the storage functions provided by specific container orchestration systems, such as when it is mounted to the Kubernetes scheduling engine.
The OpenEBS control plane is also based on microservices and implements different features, such as storage management, monitoring, and container orchestration plug-ins, through different components.
For more information on OpenEBS, see: https://openebs.io/
Similar to Rook, Heketi is an implementation of the Ceph open-source storage system on the cloud-native orchestration platform (Kubernetes). Glusterfs also has a cloud-native implementation. Heketi provides a Restful API for managing the life cycles of Gluster volumes. With Heketi, Kubernetes can dynamically provide any type of persistence supported by Gluster volumes. Heketi will automatically determine the location of a brick in the cluster and ensure that the brick and its copies are placed in different failure domains. Heketi also supports any number of Gluster storage clusters to provide network file storage for cloud services.
With Heketi, administrators no longer need to manage or configure block, disk, or storage pools. Heketi services manage all the system hardware and allocate storage as needed. Any physical storage registered with Heketi must be provided through raw devices. Heketi can manage these disks by using LVM on the provided disks.
For more details, see: https://github.com/heketi/heketi
Advantages of Open-source Container Storage
- The designs of the cloud-native storage solutions described in the proceeding sections consider the integration of storage and cloud-native orchestration systems and access to container data volumes.
- These solutions can be integrated with cloud-native applications to support quota configurations, QoS speed limiting, ACL control, snapshots, and backup as well as facilitate flexible and convenient use of storage resources.
- These solutions are open-source and have many active users in the developer community. Their abundant online resources and solutions make adoption simple and easy.
Disadvantages of Open-source Container Storage
- These solutions are less mature and are still mostly used in internal test environments or with applications with low service level requirements. They are rarely used to store critical application data.
- Poor performance: The preceding cloud-native storage solutions are outperformed by public cloud storage and commercial storage in terms of I/O performance, throughput, and latency, so they are rarely used in high-performance service scenarios.
- High subsequent maintenance costs: Although these solutions are easy to deploy and adopt, they are difficult to troubleshoot if anything goes wrong during operation. These projects are still at an early stage of development and not ready to serve in production environments. When using any of these solutions, users need to establish a strong technical team to ensure they can deal with any problems.
Current Situation and Challenges
Cloud-native application scenarios impose rigorous requirements for service agility and flexibility. Many expect fast container startup and flexible scheduling. To meet these requirements, the volumes must be adjusted agilely as the pods change.
These requirements demand the following improvements:
- Improved efficiency in mounting and detaching cloud disks: The mounting and detaching of block devices must be flexibly performed on different nodes.
- Improved self-recovery capabilities for storage devices: The storage services must be automatically recoverable to reduce human intervention.
- Solutions must support the flexible configuration of volume sizes.
Most storage services already have monitoring capabilities at the underlying file system level. However, monitoring for cloud-native data volumes needs to be enhanced. Currently, PV monitoring is inadequate in terms of data dimensions and intensity.
Provide more fine-grained monitoring capabilities at the directory level.
Provide more monitoring metrics, including read/write latency, read/write frequency, and I/O distribution.
Big data computing scenarios usually involve a large number of applications accessing storage concurrently, which makes the performance of storage services a key bottleneck that determines how the efficiency of application operations.
- The performance of underlying storage services needs to be improved, which can be achieved by optimizing high-performance storage services such as CPFS and GPFS to meet business needs.
- At the container orchestration layer, the storage scheduling capabilities need to be optimized, enabling adjacent storage access and distributed data storage to reduce the access pressure on individual volumes.
Shared Storage Isolation
Shared storage allows multiple pods to share data, which ensures unified data management and access by different applications. However, in multi-tenant scenarios, it is imperative to isolate the storage of different tenants.
The underlying storage provides strong isolation between directories, making it possible for isolation at the file system level among different tenants of a shared file system.
The container orchestration layer implements orchestration isolation based on namespaces and PSP policies. This prevents tenants from accessing the volume services of other tenants during application deployment.