MaxCompute Query Optimization with Calcite

In this article, Lei Chunwei, Senior Development Engineer from Alibaba Cloud, introduces the technologies and stories of MaxCompute and Calcite. The specific content includes:

1) What is a query optimizer?
2) The specific practice of the MaxCompute query optimizer.
3) The follow-up plan of MaxCompute.
4) What kind of personal growth has he experienced, from being recruited as an Alibaba engineer to being a Calcite committer?

The following content is based on his speech and the related PowerPoint slides.

Query Optimizer Introduction

Image for post
Image for post

As we all know, a database is generally composed of three parts: the parser, the optimizer, and the execution engine. After an SQL statement enters the database, it first needs to be parsed by the parser to generate a corresponding relational expression, then the expression is optimized by the optimizer to generate a physical execution plan, and finally the plan is executed by an execution engine. The optimizer is an important component of the database. It is the core component used in the database to convert relational expressions into execution plans, and largely determines the performance of a system. If we compare a database to a human body, the optimizer is the human brain, which determines how far and how fast the human system can go.

The query optimizer can be divided into two types:

  • Rule-Based Optimizer (RBO): It converts a relational expression into another relational expression based on the optimization rules while discarding the original expression. After a series of conversions, it generates the final execution plan. In fact, RBO is a laggard optimizer that only recognizes rules and is insensitive to data. It is easy to fall into the scenario of local superiority but global inferiority. It may also be affected by the order of the rules and thus generate different execution plans.
  • Cost-Based Optimizer (CBO): It converts the relational expression according to the optimization rules while preserving the original expression. After a series of conversions, it generates multiple execution plans, then computes the cost of each execution plan, and selects the execution plan with the lowest cost as the final execution plan. The advantage of CBO is its flexibility and intelligence. It can generate corresponding execution plans based on the characteristics of data. Therefore, at present, major databases and computing engines tend to use CBO. For example, starting from Oracle 10g, Oracle has completely abandoned RBO and used CBO instead. CBO is also introduced in Hive 0.14 version. MaxCompute has also gone through the process from RBO to CBO, which is also a major change in the process from MaxCompute 1.0 to 2.0.
Image for post
Image for post

CBO mainly has two different styles of frameworks: One is the System-R style, and the other is the Volcano/Cascade style.

  • System-R: After an SQL statement is processed by the parser and the resolver, the query is rewritten first, and then the final physical execution plan is generated by the optimizer through the dynamic planning algorithm, and finally the plan is submitted to the Executor for execution. This type of CBO was first used by IBM databases, and later used by Oracle and PostgreSQL.
  • Volcano/Cascade: The difference between Volcano/Cascade and System-R is that the physical optimizer of the former is Top-down, while the physical optimizer of the latter is Bottom-up. Volcano/Cascade generates a better physical execution plan based on a certain rule system (logical rules + physical rules). This style is currently used by SQL Server, Apache Calcite, and Pivotal.
Image for post
Image for post

Apache Calcite is an SQL optimization engine independent of storage and execution, and is currently widely used in open-source big data computing engines, such as Flink, Drill, and Kylin. The goal of Calcite is “a solution for all demand scenarios”, to provide a unified query engine for different computing platforms and data sources. As shown in the preceding figure, the Calcite architecture does not include storage and execution. The overall process is that after an SQL statement enters the JDBC Server through the JDBC Client, and is processed by the Parser and Validator, a relational algebra expression (operator expression) is generated. In addition, users can also use Expressions Builder to generate an SQL relational algebraic expression. This is applicable to systems with their own specific syntax. The generated relational algebra expression enters the Query Optimizer core engine, which is responsible for executing the dynamic planning algorithm and cost computing. Calcite also provides some pluggable features, such as Metadata Providers (providing metadata for optimization engines) and Pluggable Rules (supporting users to customize optimization rules).

Image for post
Image for post

Calcite mainly provides two optimizer implementations:

  • HepPlanner: It refers to the implementation of RBO mentioned above. It is a heuristic optimizer that matches according to the rules until the limit (the preset match limit) is reached or the rule match is no longer present after the iteration;
  • VolcanoPlanner: It refers to the implementation of the CBO mentioned above. It will iterate the rules until the plan with the lowest cost is found.

MaxCompute Query Optimizer Practices

Image for post
Image for post
  • Cost model: It is used to compute the cost to select the optimal execution plan. A good cost model may affect the performance of the entire system. The cost model is a five-tuple, including CPU, I/O, Row Count, Memory, and Network. Each operator only focuses on its own cost. The cost of the entire execution plan is accumulated by the engine. The cost model strives to reflect objective physical implementations, such as specific I/O and Network values, but does not need to obtain exactly the same data as the actual one. It only needs to ensure that a better execution plan can be selected.
  • Optimization rules: Different systems have different optimization rules, such as partition pruning rules.
  • Statistics: This part is very important for the optimizer. Lack of or poor statistics may lead to poor performance of the generated execution plan.
  • Query optimization engine: implements lots of features based on Apache Calcite, including partition pruning, join ordering, and selection of join algorithms.
  • Optimization of resource allocation based on history: After the Executor executes the task, the execution result is fed back to the module for resource allocation optimization.

The MaxCompute SQL optimizer implements the Shuffle Removal feature based on the preceding optimization engine. Shuffle refers to the process in which data is processed from an upstream task to a downstream task, including sharding and sorting. If the data of the upstream task already contains the corresponding physical attributes, then the Shuffle process is unnecessary.

Image for post
Image for post

For example, to count the average age of employees over 30 years old in departments created after 2018, the physical execution plan shown in the above figure can be obtained by using MaxCompute. First, the employees over 30 years old are filtered out from the employee table “emp”, the departments created after 2018 are filtered out from the department table “dept”, and then the join operation is performed. Considering that the data volume may be large, Merge Join is selected for the operation. Since data needs to be distributed to the same machine, the Shuffle process (EX in the figure) is required from the upstream Filter task to perform the Merge Join task. In addition, the joined data needs to be aggregated (the Agg in the figure) to compute the average age, and the Shuffle process is also required at this time. In fact, since Merge Join is used, the data has been distributed and sorted in this process. Therefore, Shuffle does not need to be repeated in the process from Join to Aggregation. The Shuffle Removal feature can be used to remove repeated Shuffle processes.

The implementation of the Shuffle Removal feature relies on the Enforcer Rule. The specific process is shown in the following figure. Operators, such as Join and Aggregate, require that the Input data must have the physical data attribute Trait (obtained through operations, such as sorting, and sharding). The Enforcer Rule can ensure the required physical attributes (Required Trait) of Input data: 1) If the Input data does not have Required Trait, a Shuffle process needs to be added; 2) If the Input data already has Required Trait, the process does not need to be added; 3) The attribute can be transferred in the Input process, that is, the Required Trait can be pushed down to its own Input. The Shuffle Removal feature has brought about 10% resource savings to Alibaba Cloud, since its launch.

Image for post
Image for post

Introduction to New Features of MaxCompute SQL

Compiler

Image for post
Image for post

Studio

1) Job monitoring: The Studio provides a detailed and intuitive job flowchart, allowing users to easily view the task flow and overall execution plan (the subgraph in the upper left corner).
2) Job analysis: The Studio provides data information related to job analysis (the subgraph in the lower left corner), such as the execution time of operator in each task. With this information, users can intuitively see the parts of the plan that are not well executed, the parts that occupy relatively high resources, and the parts that take a relatively long time to execute, so as to conduct targeted optimization, and improve the efficiency of the execution plan in a targeted manner.
3) Real-time SQL error prompt: The Studio prompts SQL errors in real time, and can prompt all errors and warnings at one time to prompt the user with the location of errors in the written SQL (the subgraph in the upper right corner).
4) Intelligent code completion: The Studio can determine and prompt the commands that a user may want to enter based on some of the letters entered by the user and the context, helping the user to fill in the code automatically and conveniently (the subgraph in the lower right corner).

Image for post
Image for post

Script Mode

Script mode solution: For this requirement, MaxCompute 2.0 provides the Script Mode feature. We also introduce the use of script mode with the example (counting the average age of employees over 30 years old in departments created after 2018) mentioned earlier. Through MaxCompute script mode, all commands can be written in a SQL file. The user can first define a variable A to filter the list of employees older than 30 from the employee table “emp”, define a variable B to filter the list of departments created after January 1, 2018 from the department table “dept”, and then define a variable C to count the average age of employees over 30 years old in these departments. The obtained results can be easily processed in script mode. For example, you can directly insert the required results into different tables by using multiple script commands (the specific commands are shown in the following figure). In the face of complex queries, this method makes query operations and subsequent operations easier to maintain and clearer.

Image for post
Image for post

The script mode has the following three main advantages:

  • Low complexity: Queries are submitted in script units. It is suitable for submitting complex queries;
  • High usability: It conforms to the thinking process of developers, and is suitable for writing complex logic;
  • High performance: It gives full play to the optimizer capabilities (based on the cost, the more complex the script, the more advantageous it will be).

Parameterized View

Parameterized view solution: To solve this problem, MaxCompute 2.0 provides the Parameterized View feature. The syntax is similar to that of creating a common view, except that the creation of a parameterized view requires two parameters, the table parameter @a (it contains two parameters k and v) and the String parameter @b. The command to create a parameterized view is shown in the following figure, in which the creation process combines the script mode. We can find that the view created in this way can be easily used by any user, because the mode is the same, and only the parameter types are different.

Image for post
Image for post

Custom Features (UDT)

UDT solution: For the above problem scenario, MaxCompute2.0 provides the UDT (User Defined Type) feature, allowing users to directly reference a third-party language class or object in SQL to obtain its data content or call its method. For the above problems, only the following SQL is needed to solve them. This method avoids frequent UDF creation and is as convenient as calling in Java.

Image for post
Image for post

External Table

External Table solution: The External Table feature in MaxCompute enables users to use SQL to access other data sources and data in other data formats. The following figure provides an example of processing CSV files stored on OSS. First, you need to define an external table. In the definition command, you need to specify the schema for CSV reading, the handler of the built-in CSV file in MaxCompute, and the location of the OSS to be processed. After the external table is defined, the data can be directly read, and processed through SQL just like an internal table.

Image for post
Image for post

The External Table feature supports multiple data sources and formats, and supports user-defined data formats (as shown in the following figure).

Image for post
Image for post

Future Plans for MaxCompute

  • In the TPCH benchmark test, the performance more than doubled.
  • Better SQL standard support: Hive does not support many SQL when running TPC-DS benchmark, while MaxCompute can achieve 100% support.
  • Better compiler: It supports one-time error reporting, efficient error recovery, and warning.
  • Good foundation for runtime: Based on the runtime implemented by CPP, higher efficiency can be achieved.
  • Better IDE support: MaxCompute Studio.
  • Better support for small jobs: Through Service Mode support, you can query small jobs in seconds or even milliseconds.
  • Support for more useful syntaxes: This includes VALUES expressions, SCALAR SUBQUERIES, INTERSECT/MINUS, PYTHON UDF, UDT/UDJ, the Script Mode, and the Parameterized View.

In subsequent versions, MaxCompute will mainly add two new features:

  • Memory computing: Data can be cached in the memory, and the data loaded once can be used many times, thus greatly reducing the running time.
  • More powerful scripting language features: For example, conditional statements, such as IF/ELSE and loop statements, such as LOOP, can be used in SQL.
Image for post
Image for post

My Personal Growth Experiences

Two years ago, I first came into contact with the open-source community. Speaking from personal experience, I have summarized two main points:

1) Growth is gradual. At the beginning, the requirements implementation I undertake may be small. But through active communication and learning with the community, the requirements will become larger and larger, and thus the contribution will gradually increase.
2) I can learn a lot from the open-source community. The open-source community pays more attention to the code quality and the communication between community members, and avoids many unnecessary conflicts and rework in advance through timely communication.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store