Flink 1.11: An Engine with Unified SQL Support for Batch and Streaming Data

Alibaba Cloud
9 min readApr 15, 2021

Many data scientists, analysts, and general business intelligence (BI) users rely on interactive Structured Query Language (SQL) queries to explore data. Flink SQL, as a core Flink module, functions as a distributed SQL query engine. Flink SQL allows users to perform union queries on disparate data sources and write complex SQL queries for data analysis in one single program. Flink SQL features high query efficiency with the integration of cost-based optimizers (CBOs), column-oriented storage, and code generation. The fault tolerance mechanism and high scalability of the Flink runtime architecture also allow Flink SQL to easily process large amounts of data.

In addition to excellent performance, enhanced usability is one of the highlights of Flink SQL in Flink 1.11. To provide a better user experience, the Flink community made the following improvements:

  • It is now easier for users to append or update table definitions.
  • Dynamic query parameters can be declared in a more flexible manner.
  • The Table API interfaces that work with TableEnvironment have been optimized and unified.
  • Connector properties have been simplified.
  • Native support is provided for Hive Data Definition Language (DDL) commands.
  • Support for Python user-defined functions (UDFs) has been improved.

These features are introduced in the following sections.

Create Table Like

Table definitions must be updated in many production scenarios. For example, many users want to add watermark definitions to the existing definitions of external tables, such as Hive metastore. If you want to merge data from multiple tables into a destination table in the extracting, transforming, and loading (ETL) process, the schema definitions of the destination table would be the collection of the definitions of all upstream tables. In this case, a new approach is required to help you merge these definitions.

Starting with Flink 1.11, you can use the LIKE syntax to easily append new definitions to existing table definitions.

The following code demonstrates an example of adding a watermark definition to base_table:

CREATE [TEMPORARY] TABLE base_table (
id BIGINT,
name STRING,
tstmp TIMESTAMP,
PRIMARY KEY(id)
) WITH (
'connector': 'kafka'
)

CREATE [TEMPORARY] TABLE derived_table (
WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND
)
LIKE base_table;

The derived_table definition is equivalent to the following definition:

CREATE [TEMPORARY] TABLE derived_table (
id BIGINT,
name STRING,
tstmp TIMESTAMP,
PRIMARY KEY(id),
WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND
) WITH (
'connector': 'kafka'
)

In contrast, the new syntax does not necessitate repeated schemas, and allows users to simply add attributes instead.

Multi-Attribute Strategy

Some may ask, are we only allowed to append or create new attributes for existing and new tables? Is there a way to overwrite or exclude attributes? The answer is yes. The LIKE syntax provides a flexible strategy for handling table attributes.

The LIKE syntax supports the use of different keywords to categorize table attributes:

  • ALL: complete table definitions
  • CONSTRAINTS: constraints such as primary keys and unique keys
  • GENERATED: mainly computed columns and watermarks
  • OPTIONS: table options defined in WITH clauses
  • PARTITIONS: artition information

Additionally, the following strategies can be applied to different attributes:

  • INCLUDING (default)
  • EXCLUDING
  • OVERWRITING

The following table illustrates the supported combinations:

Consider the following example statement:

CREATE [TEMPORARY] TABLE base_table (
id BIGINT,
name STRING,
tstmp TIMESTAMP,
PRIMARY KEY(id)
) WITH (
'connector': 'kafka',
'scan.startup.specific-offsets': 'partition:0,offset:42;partition:1,offset:300',
'format': 'json'
)

CREATE [TEMPORARY] TABLE derived_table (
WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND
)
WITH (
'connector.starting-offset': '0'
)
LIKE base_table (OVERWRITING OPTIONS, EXCLUDING CONSTRAINTS);

The resulting table in the preceding example will be equivalent to a table created with the following statement:

CREATE [TEMPORARY] TABLE derived_table (
id BIGINT,
name STRING,
tstmp TIMESTAMP,
WATERMARK FOR tstmp AS tsmp - INTERVAL '5' SECOND
) WITH (
'connector': 'kafka',
'scan.startup.specific-offsets': 'partition:0,offset:42;partition:1,offset:300',
'format': 'json'
)

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table

Dynamic Table Options

In production, making temporary adjustments to table options is a common requirement, such as for queries or to be displayed on the terminal. Take a look at the following Kafka table for example:

create table kafka_table (
id bigint,
age int,
name STRING
) WITH (
'connector' = 'kafka',
'topic' = 'employees',
'scan.startup.mode' = 'timestamp',
'scan.startup.timestamp-millis' = '123456',
'format' = 'csv',
'csv.ignore-parse-errors' = 'false'
)

Now consider making the following adjustments:

  • Modify the scan.startup.timestamp-millis option to specify the epoch timestamp.
  • Modify the format.ignore-parse-errors option to ignore parse errors.

In previous versions of Flink, you could only use statements such as ALTER TABLE to modify columns in an existing table. However, starting from Flink 1.11, you can specify, override, or dynamically add table options in WITH clauses by using dynamic table options.

The syntax is as follows:

table_name /*+ OPTIONS('k1'='v1', 'aa.bb.cc'='v2') */

The key/value pair of the OPTIONS dynamic option overrides the original table options, which means you can use it for different kinds of SQL statements. The following code is an example:

CREATE TABLE kafka_table1 (id BIGINT, name STRING, age INT) WITH (...) ;
CREATE TABLE kafka_table2 (id BIGINT, name STRING, age INT) WITH (...) ;
-- override table options in query source
select id, name from kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */;
-- override table options in join
select * from
kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ t1
join
kafka_table2 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */ t2
on t1.id = t2.id;
-- override table options for INSERT target table
insert into kafka_table1 /*+ OPTIONS('sink.partitioner'='round-robin') */ select * from kafka_table2;

There is no context restriction on the use of dynamic table options, which allows you to append definitions to any reference table as needed. Dynamic table options provide further convenience. When appended to a specified table, dynamic table options are automatically added to the original table definition.

Dynamic table options are disabled by default because they may change the semantics of the query. You can enable this function by setting the config option:

// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.dynamic-table-options.enabled", "true");

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/hints.html

SQL API Improvements

As Flink SQL evolves, it keeps providing more comprehensive support for more statements, which now leads to problems with older APIs:

  • The DDL statements passed to the sqlUpdate() method are executed immediately while the INSERT INTO statement is executed only when the execute() method is called.
  • The trigger points of Flink table programs are not clear. Both TableEnvironment.execute() and StreamExecutionEnvironment.execute() can trigger a Flink table program execution.
  • You cannot obtain the returned value from the SQL execute method, which means that you cannot obtain the results of DDL statements such as SHOW TABLES. In addition, sqlUpdate is inconsistent when used to execute different statements such as SHOW TABLES.
  • Blink planner provides the ability to optimize multiple sinks, but the TableEnvironment API does not provide a clear mechanism to control the whole flow.

In Flink 1.11, the SfQL interfaces provided by TableEnvironment are organized to offer support for consistent execution semantics, the acquisition of values returned from parallel SQL executions, and multiline statements.

Clearer Execution Semantics

TableResult is returned through the new TableEnvironment#executeSql interface, and can be iterated to provide execution results. The structure of returned data varies depending on the executed statements. For example, the SELECT statement returns query results while the INSERT statement submits jobs to the cluster asynchronously.

Simultaneous Execution of Multiple Statements

The new TableEnvironment#createStatementSet interface allows you to add multiple INSERT statements and execute them together. In multi-sink scenarios, the Blink planner optimizes the execution plans in a targeted manner.

Comparison Between the Old and New APIs

The following table lists some examples that showcase the usage of old and new APIs:

For more information, visit https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=134745878

Enhanced Compatibility with the Hive Syntax

Starting from Version 1.11, Flink allows users to write SQL statements in Hive syntax when Hive dialect is used. The Hive dialect supports DDL statements, databases, tables, views, and functions. You can use the Hive dialect with HiveCatalog to write SQL statements in Hive syntax.

Before you use the Hive dialect, you must set the dialect:

EnvironmentSettings settings = EnvironmentSettings.newInstance()...build();
TableEnvironment tableEnv = TableEnvironment.create(settings);
// to use hive dialect
tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
// use the hive catalog
tableEnv.registerCatalog(hiveCatalog.getName(), hiveCatalog);
tableEnv.useCatalog(hiveCatalog.getName());

Then, you can execute DDL statements in Hive syntax, such as the table creation statement:

create external table tbl1 (
d decimal(10,0),
ts timestamp)
partitioned by (p string)
location '%s'
tblproperties('k1'='v1');

create table tbl2 (s struct<ts:timestamp,bin:binary>) stored as orc;
create table tbl3 (
m map<timestamp,binary>
)
partitioned by (p1 bigint, p2 tinyint)
row format serde 'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe';
create table tbl4 (
x int,
y smallint)
row format delimited fields terminated by '|' lines terminated by '\n';

The Flink community is planning to provide Hive syntax compatibility for Doctrine Query Language (DQL). More syntax compatibility will be featured in Flink 1.12.

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_dialect.html

Simplified Connector Properties

In Flink 1.11, connector properties are simplified and new property keys are introduced, which are more concise and understandable than the old keys. The major changes are as follows:

  • connector.type is renamed “connector” and connector.version is changed to “connector”. For example, “kafka-0.11” is used to denote Kafka for the 0.11 version.
  • The remaining connector. prefix is removed.
  • Prefixes such as scan and sink are used to distinguish options for sources and sinks.
  • format.type is renamed “format”, and format values are used as prefixes for the properties. For example, csv is used as the prefix for the properties of the comma-separated values (CSV) format.

The following example shows how to create a Kafka table:

CREATE TABLE kafkaTable (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3)
) WITH (
'connector' = 'kafka',
'topic' = 'user_behavior',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'csv',
'scan.startup.mode' = 'earliest-offset'
)

For more information. visit https://cwiki.apache.org/confluence/display/FLINK/FLIP-122%3A+New+Connector+Property+Keys+for+New+Factory

JDBC Catalog

In the earlier versions of Flink, users could create mirror tables for relational databases only through explicit table creation, which requires tracking schema changes of tables and databases in Flink SQL. Flink 1.11 introduces the JdbcCatalog interface that enables users to connect Flink to relational databases, such as Postgres, MySQL, MariaDB, and Amazon Aurora.

Currently, PostgresCatalog is the only implementation of Java Database Connectivity (JDBC) Catalog, which is configured as follows:

CREATE CATALOG mypg WITH(
'type' = 'jdbc',
'default-database' = '...',
'username' = '...',
'password' = '...',
'base-url' = '...'
);
USE CATALOG mypg;

You can also use the JdbcCatalog interface to connect Flink to other relational databases for more JDBC Catalog implementations.

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/jdbc.html#postgres-database-as-a-catalog

Enhanced Support for Python UDFs

Flink 1.11 features enhanced support for Python UDFs in PyFlink, including defining Python UDFs with SQL DDL and SQL-CLI, vectorized Python UDFs, and user-defined metrics.

Defining Python UDFs with SQL DDL

Support for Python UDFs was first introduced in Flink 1.10.0, but was limited to the Python Table API. Flink 1.11 provides support for Python UDFs in the SQL DDL syntax, allowing users to define UDFs by using the Table API for Scala and Java or with SQL-CLI.

For example, the following example calls a Python UDF in a Java application:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
tEnv.getConfig().getConfiguration().setString("python.files", "/home/my/test1.py");
tEnv.getConfig().getConfiguration().setString("python.client.executable", "python3");
tEnv.sqlUpdate("create temporary system function func1 as 'test1.func1' language python");
Table table = tEnv.fromDataSet(env.fromElements("1", "2", "3")).as("str").select("func1(str)");
tEnv.toDataSet(table, String.class).collect();

Vectorized Python UDFs

The performance of vectorized UDFs is much better than UDFs that only support scalar values of standard Python types. Vectorized Python UDFs allow you to take full advantage of popular Python libraries, such as the Pandas and NumPy libraries. To mark a UDF as a Pandas UDF, you simply need to add an extra parameter udf_type=”pandas” in the UDF decorator.

The following example shows how to define a vectorized Python scalar function and use it in a query:

@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT(), udf_type="pandas")
def add(i, j):
return i + j
table_env = BatchTableEnvironment.create(env)# register the vectorized Python scalar function
table_env.register_function("add", add)
# use the vectorized Python scalar function in Python Table API
my_table.select("add(bigint, bigint)")
# use the vectorized Python scalar function in SQL API
table_env.sql_query("SELECT add(bigint, bigint) FROM MyTable")

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/python/vectorized_python_udfs.html

Flink 1.11 also provides a comprehensive metric system for Python UDFs that allows users to access and define user scopes and variables.

For more information, visit https://ci.apache.org/projects/flink/flink-docs-master/dev/table/python/metrics.html

Going Forward

The Flink community will focus on improving user experience in the versions to come, such as optimizing the usability of schemas, simplifying the Descriptor API, and enhancing support for Stream DDL statements. Stay tuned for more updates and features!

Original Source:

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com