Building an Enterprise-Level Real-Time Data Lake Based on Flink and Iceberg

Background and Introduction to Data Lake

  1. Variously sourced storage of raw data
  2. Supports multiple computing models
  3. Perfected data management capabilities — Various data sources can be accessed, different data sources can be connected, and schema management and permission management can be supported.
  4. Flexible bottom-layer storage — Cost-effective distributed file systems like S3, OSS, and HDFS are adopted. The data analysis requirements of corresponding scenarios are met with specific file formats and caches.
  1. The bottom is the distributed file system. S3 and OSS tend to be used more by users on the cloud because they are much cheaper. Non-cloud users generally use self-maintained HDFS.
  2. The second layer is the data acceleration layer. The Data Lake architecture is a complete storage-compute separation architecture. If all data accesses remotely read the data from the file system, performance spending and costs will be high. If some frequently accessed hotspot data can be cached locally on the computing node, the hot and cold separation is implemented naturally. Thus, good local read performance is achieved, and the bandwidth for remote access is saved. At this layer, the open-source Alluxio or Alibaba Cloud JindoFS is often selected.
  3. The third layer is the Table Format layer. It encapsulates a batch of data files into a business table and provides table-level semantics, such as ACID, snapshot, schema, and partition. It generally uses open-source Delta, Iceberg, Hudi, and other projects. For some users, Delta, Iceberg, and Hudi are considered to be data lakes. These projects are only part of the Data Lake architecture. Since they are closest to users, many details at the bottom are blocked. This is what causes the misunderstanding.
  4. The top layer is the computing engine for different computing scenarios. Open-source engines include Spark, Flink, Hive, Presto, Hive MR, and others. These computing engines can access tables in the same data lake at the same time.

Introduction to Typical Service Scenarios

Why Apache Iceberg?

Implementing Streaming Migration to the Data Lake Using Flink and Iceberg

Future Community Planning

About the Author

Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud


Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: