Pangu — The High Performance Distributed File System by Alibaba Cloud

Rapid Development of Underlying Hardware

Pressure from the Upper-Layer Business

Design Objectives

  1. Excellent Performance: Performs architecture design and engineering optimization for the next-generation network and storage software and hardware and releases dividends for technical development of software and hardware; provides the ultra-high-performance distributed file system with high throughput and low latency.
  2. Fully Distributed Metadata Management: Performs fully distributed management and dynamic splitting and migration of metadata to greatly increase the number of managed files, resolve the problem of special model dependence of metadata nodes, reduce the “explosion radius” of faults, and improve the platform stability.
  3. System Elasticity: Supports multiple product forms and shares the core paths to provide the scalability for access of more businesses in the future and prevent architecture adjustment caused by business access; unifies the hardware access interfaces and supports access of current and future new hardware optimally.
  4. Optimized Cost: Adopts the hierarchy, erasure coding (EC), compression, and deduplication technologies to reduce the storage cost, wins the initiative in increasingly fierce business competition, and gains technical advantages to cope with the exponential data growth.

Pangu 2.0’s Architecture

Core Base

Fully Distributed Metadata Management

Efficient I/O Path

Excellent Thread Model

High-Performance Network Library

Cost Control

  1. Support for the Multi-Medium Large Storage Pool: Pangu 2.0 supports large storage pools with heterogeneous media. It can use various media such as the SSD and HDD in a ChunkServer and store different files or different replicas of files in the specified media to meet requirements of metadata/data and frontend/backend data of businesses in the capacity, performance, cost, and other dimensions. Also, distribution and allocation of resources in a large storage pool to multiple businesses can improve the resource usage.
  2. EC: The latest HDFS 3.0 supports EC, while Pangu 1.0 has supported backend EC and Pangu 2.0 has supported frontend EC. Compared with the multi-replica service, EC greatly reduces the I/O and network traffic. In some scenarios, EC also reduces the cost and increases the throughput.

DFS

  1. Hadoop fs commands
  2. MapReduce without YARN + DFS
  3. MapReduce with YARN + DFS
  4. Hive without YARN + DFS
  5. Hive with YARN + DFS
  6. Spark without YARN + DFS
  7. Spark with YARN + DFS
  8. TPC-DS test SparkSQL + DFS
  9. TPC-DS test Impala + DFS

Future Prospects

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sending Email with .docx Attachment using Python

Digital Skills Needed with the Change in the Customer Service Landscape

Scrapy state between job runs

Use Case: We Need A Business Intelligence and Data Integration Tool That We Can Integrate With Via…

The change of the daily bonus for non-Prime users.

Getting Started with Kubernetes | A Brief History of Cloud-native Technologies

Peeling Back the Layers of Deployment

Global Supply Chain Security — PURL and Namespace — Emerging Conventions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Up and Running with Kafka (installation) in Simplest way

[Kafka Streams] Let’s just see what’s under the hood

Can Spark Applications Coexist with NoSQL Databases?

Welcoming audience for fun and memorable experience in theme park based blog in Apache Spark and NoSQL

Error handling with Apache Beam : presentation of Asgarde