Supporting Pre-Peak Scale-up and Post-Peak Scale-Down with DBFS

Alibaba Cloud’s Approach in 2017

Way back in 2017, Dr. Wang Jian initiated a lively discussion about whether “IDC as a computer” was possible. To achieve this objective, storage and computing resources must be separated and independently and freely scheduled by the scheduler. Among all businesses, databases lead the difficulty in achieving storage and computing separation. This is because databases impose extremely demanding requirements on I/O latency and stability. However, storage and computing separation is becoming a technical trend in the industry, and it has been implemented in Google Spanner and Aurora.

Technical Breakthroughs in 2018

As breakthroughs had been made for storage and computing separation in 2017, we pursued extreme performance and development from experiments to large-scale deployment in 2018. To reach these goals, developers faced considerable challenges. On the basis of the breakthroughs made in 2017, making storage and computing separation even more efficient, adaptive, universal, and simple presented even greater challenges in 2018.

User-State Technology

“Zero” Replication

Page Caching

To achieve buffer I/O capabilities, page caching was implemented separately by using the touch-count-based LRU algorithm. The touch count was introduced to improve the integration with the I/O features of databases. Large table scanning is common in databases, however, because scanning data pages that are rarely used can compromise LRU efficiency, this is not desirable. To address this issue, pages are moved between the hot and cold ends based on the touch count.

  1. The proportion of the hot and cold ends is configurable, currently this is 2:8.
  2. The page size is configurable. By combining with the page size of a database, page caching is optimized.
  3. Multiple shards are available for increasing concurrency, and the overall capacity is configurable.

Asynchronous I/O

Most database products use asynchronous I/O to improve I/O throughput. Similarly, DBFS implements asynchronous I/O to adapt to the I/O features of upper-layer databases. Asynchronous I/O has the following features:

  1. Allows configuration of the I/O depth, which ensures precise latency control for different database I/O types.
  2. Provides the polling-adaptive function to reduce CPU consumption.

Atomic Writing

DBFS implements atomic writing to ensure that partial write does not occur when a database page is written. DBFS-based InnoDB safely disables the double write buffer, conserving the bandwidth for the entire database during storage and computing separation.

Online Resizing

To prevent data migration due to resizing, DBFS is combined with the underlying Apsara Distributed File System for online volume resizing. DBFS uses its own bitmap allocator to manage underlying storage space. The bitmap allocator is optimized to achieve lock-free resizing at the file system layer, making it possible for the upper-layer business to efficiently resize at any time without loss. This makes DBFS superior to the traditional ext4 file system.

Switching between TCP and RDMA

The extensive use of RDMA is risky to the Group’s databases. By using DBFS along with the Apsara Distributed File System, they can implement switching between TCP and RDMA and provide switching drills throughout the link. In this way, the RDMA risks can be controlled to ensure stability.

Deployment for the Big Sales Campaign in 2018

After gaining technical breakthroughs and performing troubleshooting, DBFS eventually underwent the daunting task of withstanding the full-link traffic of the big sales campaign during the Double Eleven. This success verified the feasibility of the overall technical trend towards storage and computing separation.

Alibaba DBFS a Revolutionary Storage Tool

In addition to the preceding features, as a file system, DBFS also provides many other features to ensure its universality, ease of use, stability, and security for businesses.

Technical Accumulation and Enablement

The introduction of all our technical innovations and capabilities as products into DBFS, enables more businesses in user state to access different underlying storage media and databases to implement storage and computing separation.

// glibc interface
FILE *fopen(constchar*path,constchar*mode);
FILE *fdopen(int fildes,constchar*mode);
size_t fread(void*ptr, size_t size, size_t nmemb, FILE *stream);
size_t fwrite(constvoid*ptr, size_t size, size_t nmemb, FILE *stream);
intfflush(FILE *stream);
intfclose(FILE *stream);
intfileno(FILE *stream);
intfeof(FILE *stream);
intferror(FILE *stream);
voidclearerr(FILE *stream);
intfseeko(FILE *stream, off_t offset,int whence);
intfseek(FILE *stream,long offset,int whence);
off_t ftello(FILE *stream);
longftell(FILE *stream);
voidrewind(FILE *stream);
  1. If failover is required, the control platform sends a failover command. After the failover command is run, both DBFS and upper-layer databases have completed role switching.
  2. During DBFS failover, the key “I/O fence” action disables the I/O capability of the M node to prevent dual-writes from occurring.

Integration of Hardware and Software

With the emergence of new storage media, databases need to improve performance or lower costs as well as to control the underlying storage media.

  1. Dynamic enablement and disablement
  2. Load balancing
  3. Performance metrics collection and presentation
  4. Data correctness scrubbing

Summary and Prospects

In 2018, DBFS widely supported X-DB and the Double 11 in storage and computing separation mode. Meanwhile, it enables ADS to implement the single-write-and-multi-read function and the Tair solution.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: