Seven Challenges for Cloud-native Storage in Practical Scenarios

By Eric Li (Zhuanghuai), Head of Alibaba Cloud-native Storage Business

Introduction

Portability, extensibility, and dynamic features are inevitable for high performing cloud-native applications. However, cloud-native applications also impose requirements on density, speed, and hybrid performance of cloud-native storage. These requirements translate into requirements for efficiency, elasticity, autonomy, stability, low application coupling, GuestOS optimization, and security, in addition to the basic capabilities of cloud storage.

It’s critical to address challenges related to performance, elasticity, high availability, encryption, isolation, observability, lifecycle, and other aspects, which arise from the containerization, cloud migration, and storage of new enterprise load or intelligent workload. Just upgrading storage products is not enough, improving the cloud-native control and data planes to promote the evolution of cloud-native storage are also equally important.

The following section describes possible cloud-native storage problems, scenarios, and feasible solutions. And further, describes what cloud-native storage and cloud storage are capable of now and what else to expect in the future.

1) Performance of Storage

Scenario

Huge data is processed as a batch in a high-performance computing scenario, where thousands of pods start simultaneously through Kubernetes clusters, and autoscaler scales out hundreds of ECS instances to read data from and write data to shared file systems.

Problem

Latency time increases under heavy load, with an increased spike in high latency and unstable reads/writes.

Solution

  • Distribute the load to multiple file systems. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use Apsara Distributed File System 2.0 to empower storage products.

Scenario

Read/write requests of 10 Gbit/s in size are distributed to the same storage cluster in a high-performance computing scenario for centralized processing.

Problem

The bandwidth of the storage cluster is squeezed and affects the cluster access experience.

Solution

  • Distribute the load to multiple file systems as well as multiple storage clusters or zones. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use the exclusive and high-performance Cloud Parallel File System (CPFS).

Scenario

A large amount of biological data is processed and the number of files is small. However, the peak throughput is as high as 10 Gbit/s to 30 Gbit/s, with request density of 10,000 requests per second.

Problem

The occupied bandwidth almost reaches the bandwidth limit of the exclusive cluster.

Solution

  • Split reads and writes, with the former distributed to Object Storage Service (OSS), and the latter to an exclusive file system and local or remote block storage. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use an application-layer distributed cache to reduce network I/O reads.
  • Use Apsara Distributed File System 2.0 to empower storage products.

Scenario

Multi-host multi-card GPU training is performed. This is a read-intensive scenario as data in OSS is read directly.

Problem

Higher latency leads to a high IO wait value and GPU waits.

Solution

  • Applications read OSS transparently in the POSIX manner.
  • Use an application-layer distributed cache to reduce network I/O reads.

2) Elasticity of Storage

  • Disk capacities of databases such as MySQL are resized online.
  • KV store is resized online, such as ZooKeeper and etcd.
  • The storage capacity of the local disk is fixed.
  • Implement storage resizing. It resizes block storage (cloud disks) online and resizes logic volume or filesystems offline or online.
  • Adjust the mounting density of standalone cloud disks.
  • Use enhanced SSDs as a replacement.

3) High Availability of Storage

  • Application and system O&M is performed.
  • The stability and discoverability of block storage should be ensured during its migration along with containers.
  • Implement declarative storage of snapshots and backups on the control plane, back up snapshots on a regular basis, and speed up backups and recoveries of local snapshots.
  • Improve the discoverability of cloud disks by SerialNum on the control plane and identify the device with a unique disk ID.

4) Encryption of Storage

  • End-to-end data of user applications is encrypted.
  • The operating system disk is encrypted.
  • Cloud storage supports data encryption with CMK and BYOK.
  • Control plane supports the specification claim of encryption method.
  • Implement permission minimization of RAM.

5) Isolation of Storage

  • One disk is shared among multiple applications, and one block storage is split with logic volume management (LVM) into multiple logic volumes for capacity isolation of logs.
  • A single local disk or cloud disk is stretched in throughput.
  • File system capacity quotas are implemented in a multitenancy environment.
  • Control over shared access to cluster-level file systems is implemented.
  • Split disks on the control plane, and enable QoS of blkio buffer I/O at the application level.
  • Implement multi-disk aggregation and striping for LVMs on the control plane.
  • Implement directory-level quotas for shared file systems of storage products.
  • Implement directory-level ACLs for file systems of the control plane.

6) Observability of Storage

The tenant or application-level I/O metric monitoring and alerting are implemented in a multitenancy ZooKeeper or etcd environment.

  • Use application-level I/O metrics collection on the control plane.
  • Use device-level I/O metrics collection on the control plane.
  • Use mount-point-level I/O metrics collection on the control plane.

7) Lifecycle of Storage

Shared file systems or cache systems are created and deleted in a declarative manner.

  • Operator: cloud disk or local disk (TiDB)
  • Operator: file system, CPFS
  • Operator for OSS

Cloud-native Storage V2

In response to the preceding challenges in storage performance, elasticity, high availability, encryption, isolation, observability, lifecycle, and other aspects of a new computing mode, it’s imperative to not just upgrade storage products, but also make improvements to the cloud-native control or data planes, to achieve stable, secure, autonomous, and efficient cloud-native storage v2 in the near future.

  • Stability: All Alibaba cloud storage products support observability, Flexvolume and CSI plug-ins, and I/O metrics (CSI for 1.14).
  • Security: Alibaba cloud storage products enable reliable and trusted storage of data throughout the process, as well as CSI snapshot encryption and system disk encryption.
  • Autonomy: Alibaba cloud storage products support cloud disk snapshots and local snapshots, offline and online storage expansion, and automatic discovery of metadata.
  • Efficiency: NAS/EBS: I/O isolation; NAS/EBS/OSS: throughput scalability; EBS: attached disk density; CSI Cache: cluster-level aggregate throughput.

With effective improvements and enhancements on the cloud-native application layer, the storage cloud product layer, the underlying storage adaptation, and the storage core layer, it’s possible to provide more stable, secure, autonomous, and efficient application-oriented cloud-native storage.

Summary

Let’s take a look at a quick snapshot of the article:

  • Cloud-native storage is a collection of capabilities, such as cloud storage UIs and efficiency.
  • Tiered storage does not require reinventing the wheel.
  • New workload promotes the evolution of cloud-native storage and cloud storage. The cloud-native control plane ensures high levels of efficiency and autonomy to enhance storage stability and reduce security risks on the data plane. Cloud storage continues to reinforce its basic capabilities including performance, capacity, elasticity, and density to co-build a storage ecosystem in a cloud-native environment.

The evolution of cloud-native storage v2 still requires a joint effort from the container team and the storage team, so that the storage capability in the cloud-native era is improved.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.