Pangu — The High Performance Distributed File System by Alibaba Cloud

Rapid Development of Underlying Hardware

Pressure from the Upper-Layer Business

Design Objectives

  1. Excellent Performance: Performs architecture design and engineering optimization for the next-generation network and storage software and hardware and releases dividends for technical development of software and hardware; provides the ultra-high-performance distributed file system with high throughput and low latency.
  2. Fully Distributed Metadata Management: Performs fully distributed management and dynamic splitting and migration of metadata to greatly increase the number of managed files, resolve the problem of special model dependence of metadata nodes, reduce the “explosion radius” of faults, and improve the platform stability.
  3. System Elasticity: Supports multiple product forms and shares the core paths to provide the scalability for access of more businesses in the future and prevent architecture adjustment caused by business access; unifies the hardware access interfaces and supports access of current and future new hardware optimally.
  4. Optimized Cost: Adopts the hierarchy, erasure coding (EC), compression, and deduplication technologies to reduce the storage cost, wins the initiative in increasingly fierce business competition, and gains technical advantages to cope with the exponential data growth.

Pangu 2.0’s Architecture

Core Base

Fully Distributed Metadata Management

Efficient I/O Path

Excellent Thread Model

High-Performance Network Library

Cost Control

  1. Support for the Multi-Medium Large Storage Pool: Pangu 2.0 supports large storage pools with heterogeneous media. It can use various media such as the SSD and HDD in a ChunkServer and store different files or different replicas of files in the specified media to meet requirements of metadata/data and frontend/backend data of businesses in the capacity, performance, cost, and other dimensions. Also, distribution and allocation of resources in a large storage pool to multiple businesses can improve the resource usage.
  2. EC: The latest HDFS 3.0 supports EC, while Pangu 1.0 has supported backend EC and Pangu 2.0 has supported frontend EC. Compared with the multi-replica service, EC greatly reduces the I/O and network traffic. In some scenarios, EC also reduces the cost and increases the throughput.


  1. Hadoop fs commands
  2. MapReduce without YARN + DFS
  3. MapReduce with YARN + DFS
  4. Hive without YARN + DFS
  5. Hive with YARN + DFS
  6. Spark without YARN + DFS
  7. Spark with YARN + DFS
  8. TPC-DS test SparkSQL + DFS
  9. TPC-DS test Impala + DFS

Future Prospects




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Day 17 enemy shields and missiles!

Top 10 Blogs to get you started in React Native

Use of Emoji in Python

Developing the Weglot hreflang tool

A guide for a successful Sprint Retrospective

Free up Space on Android by Doing these 5 things 😀

Into the gritty bits

Contemporary Issues in the Design of a New Language

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Stream avro data from kafka over ssl to Apache pinot

ROSETTA ERROR in starting Kafka Zookeeper on MAC M1

Apache Storm on Kubernetes

Building Event Driven Applications with Apache Flink, Apache Kafka and Amazon EMR — Part 2