Apache Iceberg 0.11.0: Features and Deep Integration with Flink

  • The Flink Streaming Reader is supported, allowing users to incrementally pull the newly generated data from the Apache Iceberg through Flink stream processing. For the stream-batch unified storage layer such as Apache Iceberg, Apache Flink is the first computing engine that implements the stream-batch unified read and write of Iceberg. It also marks a new chapter in creating a data lake architecture with stream-batch unification using Apache Flink and Apache Iceberg.
  • The limit pushdown and filter pushdown of Flink Streaming and Batch Reader are implemented.
  • The CDC and Upsert events are written into Apache Iceberg through the Flink computing engine, with the correctness validated based on a medium scale of data.
  • write.distribution-mode=hash is supported to write data in the Flink Iceberg Sink, which can reduce a large number of small files from the source.
  • MERGE INTO
  • DELETE FROM
  • ALTER TABLE … ADD/DROP PARTITION
  • ALTER TABLE … WRITE ORDERED BY
  • More data management operations through Call, such as merging small files and deleting expired files.
  • Introduced the AWS module for integration with cloud services, such as AWS S3 and Glue Catalog; and
  • Integrate the popular open-source catalog service Nessie.

Apache Flink Streaming Read

-- Submit the Flink job in the streaming mode for the current session.
SET execution. Type = streaming;
-- Enable this switch because streaming read SQL will provide few job options in Flink SQL hint options.
SET table.dynamic-table-options.enabled= true;
-- Read all the records from the iceberg current snapshot, and then read incremental data starting from that snapshot.
SELECT * FROM sample /*+ OPTIONS ('streaming'=' true','monitor-interval'='1 s')*/ ;
-- Read all incremental data starting from the snapshot-id '3821550127947089987' (records from this snapshot will be excluded).
SELECT * FROM sample /*+ OPTIONS ('streaming'=' true','monitor-interval'='1 s ','start-snapshot-id'='3821550127947089987')*/ ;

Limit Pushdown and Filter Pushdown of Flink Source

SELECT * FROM  sample  LIMIT  10;
SELECT * FROM sample WHERE data = 'a';
SELECT * FROM sample WHERE data != 'a';
SELECT * FROM sample WHERE data >= 'a';
SELECT * FROM sample WHERE data <= 'a';
SELECT * FROM sample WHERE data < 'a';
SELECT * FROM sample WHERE data > 'a';
SELECT * FROM sample WHERE data = 'a' AND id = 1;
SELECT * FROM sample WHERE data = 'a' OR id = 1;
SELECT * FROM sample WHERE data IS NULL;
SELECT * FROM sample WHERE NOT (id = 1);
SELECT * FROM sample WHERE data LIKE 'aaa%';

Support for CDC and Upsert Events

write.distribution-mode=hash is Supported to Write Data to Apache Iceberg

CREATE  TABLE  sample (
ID BIGINT,
Data STRING
) PARTITIONED BY( data) WITH(
'write.distribution-mode'= 'hash'
);
Table .updateSpec()
.addField( Expressions.bucket(" ID", 32 bytes)
.commit();

Summary

  • A large amount of log data is flushed through Flink in Tencent every day and then imported to Iceberg. Tens of TBs of data is generated on a daily basis;
  • Netflix imports almost all user behavior data to Iceberg through Flink’s stream computing and then stores it in AWS S3. Compared to HDFS, Flink and Iceberg help the company minimize storage costs;
  • Tongcheng Yilong (merged by TravelGo and eLong) has also explored Flink and Iceberg a lot. It stored almost all the analysis data in Hive. Given Hive’s weakness in ACID and history backfill, Tongcheng Yilong studied Iceberg and found that Iceberg is very suitable for replacing their Hive storage formats. In addition, due to the excellent connection with the upper-layer computing ecosystem, it is easy to switch Hive tables to Iceberg without changing all historical computing tasks. So far, Tongcheng Yilong has switched dozens of Hive tables to Iceberg tables; and
  • Autohome is one of the companies that successfully replaced Hive tables with Iceberg tables on a large scale in the production environment. At the same time, it is also the first company to use the community version of Iceberg to do CDC and Upsert data analysis PoC. They are also looking forward to more optimizations of CDC and Upsert scenarios in version 0.12.0.

References

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com