Using Hive in Apache Flink 1.9

Overview — Flink on Hive

Design Architecture

1 Metadata

2 Table Data

Project Progress

  • Apache Flink provides simple Data Definition Language (DDL) statements to read Hive metadata, such as show databases, show tables, and describe tables.
  • Allows using the Catalog API to modify Hive metadata, such as create table and drop table.
  • Allows reading Hive data in both partition tables and non-partition tables.
  • Supports writing Hive data in non-partitioned tables.
  • Supports file formats, such as Text, ORC, Parquet, and SequenceFile
  • Allows to call UDFs created in Hive
  • INSERT and OVERWRITE are not supported.
  • Writing data to the partition table is not supported.
  • ACID tables are not supported.
  • Bucket tables are not supported.
  • Data view is not supported.

Application

1 Add Dependencies

2 Configure HiveCatalog

2.1 SQL Client

- name: myhive
type: hive
hive-conf-dir: /path/to/hive_conf_dir
hive-version: 2.3.4
Flink SQL> show catalogs;
default_catalog
myhive
Flink SQL> use catalog myhive;

2.2 Table API

String name = "myhive";
String defaultDatabase = "default";
String hiveConfDir = "/path/to/hive_conf_dir";
String version = "2.3.4";
TableEnvironment tableEnv = ⋯; // create TableEnvironment
HiveCatalog hiveCatalog = new HiveCatalog(name, defaultDatabase,
hiveConfDir, version);
tableEnv.registerCatalog(name, hiveCatalog);
tableEnv.useCatalog(name);

3 Read and Write Hive Tables

3.1 SQL Client

Flink SQL> describe src;
root
|-- key: STRING
|-- value: STRING
Flink SQL> select * from src; key value
100 val_100
298 val_298
9 val_9
341 val_341
498 val_498
146 val_146
458 val_458
362 val_362
186 val_186
⋯⋯ ⋯⋯
Flink SQL> insert into src values ('newKey','newVal');

3.2 Table API

TableEnvironment tableEnv = ⋯; // create TableEnvironment
tableEnv.registerCatalog("myhive", hiveCatalog);
// set myhive as current catalog
tableEnv.useCatalog("myhive");
Table src = tableEnv.sqlQuery("select * from src");
// write src into a sink or do further analysis
⋯⋯
tableEnv.sqlUpdate("insert into src values ('newKey', 'newVal')");
tableEnv.execute("insert into src");

4 Support for Different Hive Versions

5 Selection of Execution Mode and Planner

# select the implementation responsible for planning table programs
# possible values are 'old' (used by default) or 'blink'
planner: blink
# 'batch' or 'streaming' execution
type: batch
EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
TableEnvironment tableEnv = TableEnvironment.create(settings);

Future Planning

  • Adding support for more data types
  • Adding support for writing data to partition tables, including static and dynamic partitions
  • Adding support for INSERT and OVERWRITE
  • Support for View
  • Providing better support for DDL and DML statements
  • Adding support for TableSink in streaming mode to allow users to conveniently write stream data to Hive
  • Testing and supporting more Hive versions
  • Adding support for Bucket tables
  • Adding support for performance testing and optimization

Original Source:

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Best Practices: Android Frameworks For Faster App Development

android frameworks

What’s The Point Of Two Pointers?

Lyft’s Large-scale Flink-based Near Real-time Data Analytics Platform

How to Create Drupal Role with Ansible Playbook

CS371p Spring 2021: Luca Chaves Rodrigues Noronha dos Santos

Update of the situation:

Exploring Aliyun Linux — A Linux Distro Made in the Cloud for the Cloud

Python’s Attribute Lookup

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

GCP — Execute Jar on Databricks from Airflow — Big Data Processing

Apache DolphinScheduler Quick-start | Video Tutorial

DevOps in Data: The Volcanic Waltz

A volcano erupting

Integrating Apache Pulsar with BigQuery