Using Hive in Apache Flink 1.9

Overview — Flink on Hive

Design Architecture

1 Metadata

2 Table Data

Project Progress

  • Apache Flink provides simple Data Definition Language (DDL) statements to read Hive metadata, such as show databases, show tables, and describe tables.
  • Allows using the Catalog API to modify Hive metadata, such as create table and drop table.
  • Allows reading Hive data in both partition tables and non-partition tables.
  • Supports writing Hive data in non-partitioned tables.
  • Supports file formats, such as Text, ORC, Parquet, and SequenceFile
  • Allows to call UDFs created in Hive
  • INSERT and OVERWRITE are not supported.
  • Writing data to the partition table is not supported.
  • ACID tables are not supported.
  • Bucket tables are not supported.
  • Data view is not supported.

Application

1 Add Dependencies

2 Configure HiveCatalog

2.1 SQL Client

- name: myhive
type: hive
hive-conf-dir: /path/to/hive_conf_dir
hive-version: 2.3.4
Flink SQL> show catalogs;
default_catalog
myhive
Flink SQL> use catalog myhive;

2.2 Table API

String name = "myhive";
String defaultDatabase = "default";
String hiveConfDir = "/path/to/hive_conf_dir";
String version = "2.3.4";
TableEnvironment tableEnv = ⋯; // create TableEnvironment
HiveCatalog hiveCatalog = new HiveCatalog(name, defaultDatabase,
hiveConfDir, version);
tableEnv.registerCatalog(name, hiveCatalog);
tableEnv.useCatalog(name);

3 Read and Write Hive Tables

3.1 SQL Client

Flink SQL> describe src;
root
|-- key: STRING
|-- value: STRING
Flink SQL> select * from src; key value
100 val_100
298 val_298
9 val_9
341 val_341
498 val_498
146 val_146
458 val_458
362 val_362
186 val_186
⋯⋯ ⋯⋯
Flink SQL> insert into src values ('newKey','newVal');

3.2 Table API

TableEnvironment tableEnv = ⋯; // create TableEnvironment
tableEnv.registerCatalog("myhive", hiveCatalog);
// set myhive as current catalog
tableEnv.useCatalog("myhive");
Table src = tableEnv.sqlQuery("select * from src");
// write src into a sink or do further analysis
⋯⋯
tableEnv.sqlUpdate("insert into src values ('newKey', 'newVal')");
tableEnv.execute("insert into src");

4 Support for Different Hive Versions

5 Selection of Execution Mode and Planner

# select the implementation responsible for planning table programs
# possible values are 'old' (used by default) or 'blink'
planner: blink
# 'batch' or 'streaming' execution
type: batch
EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
TableEnvironment tableEnv = TableEnvironment.create(settings);

Future Planning

  • Adding support for more data types
  • Adding support for writing data to partition tables, including static and dynamic partitions
  • Adding support for INSERT and OVERWRITE
  • Support for View
  • Providing better support for DDL and DML statements
  • Adding support for TableSink in streaming mode to allow users to conveniently write stream data to Hive
  • Testing and supporting more Hive versions
  • Adding support for Bucket tables
  • Adding support for performance testing and optimization

Original Source:

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Greetings #KVTcommunity

Welcome to our fourth article of this series.

Anonymous functions in python (Lambda)

Setup IPSec Tunnel between Microsoft Azure and Alibaba Cloud with VPN Gateway

Synchronous vs. Asynchronous Replication Strategy

Sharing a Folder To a Kali Linux VM in VMware Fusion

Switched to Windows from Linux

Null is your friend, not a mistake

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Dynamic SQL processing with Apache Flink

Undatum: command-line JSON lines/BSON data processing tool

How can I manage my Kafka Artefacts?

Databricks Workspace SSO: Integration with Keycloak and SAML 2.0

Databricks admin console single sign-on form