Seven Challenges for Cloud-native Storage in Practical Scenarios

Image for post
Image for post

By Eric Li (Zhuanghuai), Head of Alibaba Cloud-native Storage Business

Introduction

Portability, extensibility, and dynamic features are inevitable for high performing cloud-native applications. However, cloud-native applications also impose requirements on density, speed, and hybrid performance of cloud-native storage. These requirements translate into requirements for efficiency, elasticity, autonomy, stability, low application coupling, GuestOS optimization, and security, in addition to the basic capabilities of cloud storage.

It’s critical to address challenges related to performance, elasticity, high availability, encryption, isolation, observability, lifecycle, and other aspects, which arise from the containerization, cloud migration, and storage of new enterprise load or intelligent workload. Just upgrading storage products is not enough, improving the cloud-native control and data planes to promote the evolution of cloud-native storage are also equally important.

The following section describes possible cloud-native storage problems, scenarios, and feasible solutions. And further, describes what cloud-native storage and cloud storage are capable of now and what else to expect in the future.

1) Performance of Storage

1.1) Increase of High Latency

Scenario

Huge data is processed as a batch in a high-performance computing scenario, where thousands of pods start simultaneously through Kubernetes clusters, and autoscaler scales out hundreds of ECS instances to read data from and write data to shared file systems.

Problem

Latency time increases under heavy load, with an increased spike in high latency and unstable reads/writes.

Solution

  • Distribute the load to multiple file systems. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use Apsara Distributed File System 2.0 to empower storage products.

1.2) Impact of Centralized High-throughput Writes on Shared Storage Pools

Scenario

Read/write requests of 10 Gbit/s in size are distributed to the same storage cluster in a high-performance computing scenario for centralized processing.

Problem

The bandwidth of the storage cluster is squeezed and affects the cluster access experience.

Solution

  • Distribute the load to multiple file systems as well as multiple storage clusters or zones. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use the exclusive and high-performance Cloud Parallel File System (CPFS).

1.3) Low Peak Throughput

Scenario

A large amount of biological data is processed and the number of files is small. However, the peak throughput is as high as 10 Gbit/s to 30 Gbit/s, with request density of 10,000 requests per second.

Problem

The occupied bandwidth almost reaches the bandwidth limit of the exclusive cluster.

Image for post
Image for post

Solution

  • Split reads and writes, with the former distributed to Object Storage Service (OSS), and the latter to an exclusive file system and local or remote block storage. Use container orchestration to distribute the I/O load to multiple file systems.
  • Use an application-layer distributed cache to reduce network I/O reads.
  • Use Apsara Distributed File System 2.0 to empower storage products.

1.4) Higher Latency Leading to GPU Waits

Scenario

Multi-host multi-card GPU training is performed. This is a read-intensive scenario as data in OSS is read directly.

Problem

Higher latency leads to a high IO wait value and GPU waits.

Image for post
Image for post

Solution

  • Applications read OSS transparently in the POSIX manner.
  • Use an application-layer distributed cache to reduce network I/O reads.

2) Elasticity of Storage

Scenario

  • Disk capacities of databases such as MySQL are resized online.
  • KV store is resized online, such as ZooKeeper and etcd.
  • The storage capacity of the local disk is fixed.

Solution

  • Implement storage resizing. It resizes block storage (cloud disks) online and resizes logic volume or filesystems offline or online.
  • Adjust the mounting density of standalone cloud disks.
  • Use enhanced SSDs as a replacement.

3) High Availability of Storage

Scenario

  • Application and system O&M is performed.
  • The stability and discoverability of block storage should be ensured during its migration along with containers.

Solution

  • Implement declarative storage of snapshots and backups on the control plane, back up snapshots on a regular basis, and speed up backups and recoveries of local snapshots.
  • Improve the discoverability of cloud disks by SerialNum on the control plane and identify the device with a unique disk ID.

4) Encryption of Storage

Scenario

  • End-to-end data of user applications is encrypted.
  • The operating system disk is encrypted.

Solution

  • Cloud storage supports data encryption with CMK and BYOK.
  • Control plane supports the specification claim of encryption method.
  • Implement permission minimization of RAM.

5) Isolation of Storage

Scenario

  • One disk is shared among multiple applications, and one block storage is split with logic volume management (LVM) into multiple logic volumes for capacity isolation of logs.
  • A single local disk or cloud disk is stretched in throughput.
  • File system capacity quotas are implemented in a multitenancy environment.
  • Control over shared access to cluster-level file systems is implemented.

Solution

  • Split disks on the control plane, and enable QoS of blkio buffer I/O at the application level.
  • Implement multi-disk aggregation and striping for LVMs on the control plane.
  • Implement directory-level quotas for shared file systems of storage products.
  • Implement directory-level ACLs for file systems of the control plane.

6) Observability of Storage

Scenario

The tenant or application-level I/O metric monitoring and alerting are implemented in a multitenancy ZooKeeper or etcd environment.

Solution

  • Use application-level I/O metrics collection on the control plane.
  • Use device-level I/O metrics collection on the control plane.
  • Use mount-point-level I/O metrics collection on the control plane.

7) Lifecycle of Storage

Scenario

Shared file systems or cache systems are created and deleted in a declarative manner.

Solution

  • Operator: cloud disk or local disk (TiDB)
  • Operator: file system, CPFS
  • Operator for OSS

Cloud-native Storage V2

In response to the preceding challenges in storage performance, elasticity, high availability, encryption, isolation, observability, lifecycle, and other aspects of a new computing mode, it’s imperative to not just upgrade storage products, but also make improvements to the cloud-native control or data planes, to achieve stable, secure, autonomous, and efficient cloud-native storage v2 in the near future.

  • Stability: All Alibaba cloud storage products support observability, Flexvolume and CSI plug-ins, and I/O metrics (CSI for 1.14).
  • Security: Alibaba cloud storage products enable reliable and trusted storage of data throughout the process, as well as CSI snapshot encryption and system disk encryption.
  • Autonomy: Alibaba cloud storage products support cloud disk snapshots and local snapshots, offline and online storage expansion, and automatic discovery of metadata.
  • Efficiency: NAS/EBS: I/O isolation; NAS/EBS/OSS: throughput scalability; EBS: attached disk density; CSI Cache: cluster-level aggregate throughput.
Image for post
Image for post

With effective improvements and enhancements on the cloud-native application layer, the storage cloud product layer, the underlying storage adaptation, and the storage core layer, it’s possible to provide more stable, secure, autonomous, and efficient application-oriented cloud-native storage.

Image for post
Image for post

Summary

Let’s take a look at a quick snapshot of the article:

  • Cloud-native storage is a collection of capabilities, such as cloud storage UIs and efficiency.
  • Tiered storage does not require reinventing the wheel.
  • New workload promotes the evolution of cloud-native storage and cloud storage. The cloud-native control plane ensures high levels of efficiency and autonomy to enhance storage stability and reduce security risks on the data plane. Cloud storage continues to reinforce its basic capabilities including performance, capacity, elasticity, and density to co-build a storage ecosystem in a cloud-native environment.

The evolution of cloud-native storage v2 still requires a joint effort from the container team and the storage team, so that the storage capability in the cloud-native era is improved.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store