MaxCompute Query Optimization with Calcite

Query Optimizer Introduction

  • Rule-Based Optimizer (RBO): It converts a relational expression into another relational expression based on the optimization rules while discarding the original expression. After a series of conversions, it generates the final execution plan. In fact, RBO is a laggard optimizer that only recognizes rules and is insensitive to data. It is easy to fall into the scenario of local superiority but global inferiority. It may also be affected by the order of the rules and thus generate different execution plans.
  • Cost-Based Optimizer (CBO): It converts the relational expression according to the optimization rules while preserving the original expression. After a series of conversions, it generates multiple execution plans, then computes the cost of each execution plan, and selects the execution plan with the lowest cost as the final execution plan. The advantage of CBO is its flexibility and intelligence. It can generate corresponding execution plans based on the characteristics of data. Therefore, at present, major databases and computing engines tend to use CBO. For example, starting from Oracle 10g, Oracle has completely abandoned RBO and used CBO instead. CBO is also introduced in Hive 0.14 version. MaxCompute has also gone through the process from RBO to CBO, which is also a major change in the process from MaxCompute 1.0 to 2.0.
  • System-R: After an SQL statement is processed by the parser and the resolver, the query is rewritten first, and then the final physical execution plan is generated by the optimizer through the dynamic planning algorithm, and finally the plan is submitted to the Executor for execution. This type of CBO was first used by IBM databases, and later used by Oracle and PostgreSQL.
  • Volcano/Cascade: The difference between Volcano/Cascade and System-R is that the physical optimizer of the former is Top-down, while the physical optimizer of the latter is Bottom-up. Volcano/Cascade generates a better physical execution plan based on a certain rule system (logical rules + physical rules). This style is currently used by SQL Server, Apache Calcite, and Pivotal.
  • HepPlanner: It refers to the implementation of RBO mentioned above. It is a heuristic optimizer that matches according to the rules until the limit (the preset match limit) is reached or the rule match is no longer present after the iteration;
  • VolcanoPlanner: It refers to the implementation of the CBO mentioned above. It will iterate the rules until the plan with the lowest cost is found.

MaxCompute Query Optimizer Practices

The overall framework of the MaxCompute SQL optimizer is shown in the following figure. It mainly includes five parts:

  • Cost model: It is used to compute the cost to select the optimal execution plan. A good cost model may affect the performance of the entire system. The cost model is a five-tuple, including CPU, I/O, Row Count, Memory, and Network. Each operator only focuses on its own cost. The cost of the entire execution plan is accumulated by the engine. The cost model strives to reflect objective physical implementations, such as specific I/O and Network values, but does not need to obtain exactly the same data as the actual one. It only needs to ensure that a better execution plan can be selected.
  • Optimization rules: Different systems have different optimization rules, such as partition pruning rules.
  • Statistics: This part is very important for the optimizer. Lack of or poor statistics may lead to poor performance of the generated execution plan.
  • Query optimization engine: implements lots of features based on Apache Calcite, including partition pruning, join ordering, and selection of join algorithms.
  • Optimization of resource allocation based on history: After the Executor executes the task, the execution result is fed back to the module for resource allocation optimization.

Introduction to New Features of MaxCompute SQL

MaxCompute handles 99% of computing and storage at Alibaba, and focuses on improving the user experience and computing efficiency of external users. Next, the new features of MaxCompute 2.0 are introduced, mainly including the compiler, the development tool Studio, the procedural support features (such as the Script Mode and the Parameterized View) to help improve development efficiency, and custom features (such as UDT and the External Table). As the public cloud has not been officially released, you can search for the DingTalk group number 11782920, and contact the group owner Jin Heng, for use.

Compiler

In MaxCompute 2.0, the compiler currently has three major new features: 1) All compilation errors include row or column numbers to help users clearly understand the location of errors; 2) Multiple errors in synonym compilation are reported in the form of an error report at one time to facilitate modification; 3) Warning is supported to prompt the user for potential problems, such as loss of Double = STRING precision. This feature needs to be enabled in advance by setting odps.compiler.warning.disable=false. The above three features are very practical in development, and will greatly improve the development efficiency of users.

Studio

The second one that needs to be strongly recommended is Maxcompute 2.0 Studio. As an IDE, users can use it to write scripts. The main benefits include the following (as shown in the following figure):

Script Mode

Problem scenarios: When faced with a project with complex processing logic, you need to extract multiple tables and then join these tables, and then join the results. In addition, multiple tables need to be output from different run phases, which cannot be expressed even if CTE (Common Table Exclusion) is used. In this case, these steps can only be split into multiple jobs and submitted in sequence. The complexity and maintenance cost of this solution are both high, and it also has a great impact on performance.

  • Low complexity: Queries are submitted in script units. It is suitable for submitting complex queries;
  • High usability: It conforms to the thinking process of developers, and is suitable for writing complex logic;
  • High performance: It gives full play to the optimizer capabilities (based on the cost, the more complex the script, the more advantageous it will be).

Parameterized View

Problem scenarios: Suppose a user creates a view for some teams. The function of the view is to read a data table, and run the newly written pattern recognition algorithm to provide advertisement recommendations. Other teams think that the algorithm is good and also want to use it, but the data table accessed at the underlying layer is different, and some parameters of pattern recognition are also different, so the user has to create a new view for them. However, it is later discovered that the original view has bugs, which can only be fixed one by one, resulting in significant complexity for maintenance.

Custom Features (UDT)

Problem scenarios: UDFs can be written in both Hive and MaxCompute to meet some requirements without built-in functions. For example, you need to implement a feature to parse JSON strings. This feature is very easy to implement in other languages. For example, you may only need to call the JSON function once in Java. However, this feature is not implemented for built-in MaxCompute functions, so writing a UDF is too complicated.

External Table

Problem scenarios: The data of MaxCompute users are stored on other storage systems (such as HDFS), and the format is CSV. Can SQL be used to operate?

Future Plans for MaxCompute

Before introducing the follow-up plan for MaxCompute, let’s first summarize its advantages over Hive:

  • In the TPCH benchmark test, the performance more than doubled.
  • Better SQL standard support: Hive does not support many SQL when running TPC-DS benchmark, while MaxCompute can achieve 100% support.
  • Better compiler: It supports one-time error reporting, efficient error recovery, and warning.
  • Good foundation for runtime: Based on the runtime implemented by CPP, higher efficiency can be achieved.
  • Better IDE support: MaxCompute Studio.
  • Better support for small jobs: Through Service Mode support, you can query small jobs in seconds or even milliseconds.
  • Support for more useful syntaxes: This includes VALUES expressions, SCALAR SUBQUERIES, INTERSECT/MINUS, PYTHON UDF, UDT/UDJ, the Script Mode, and the Parameterized View.
  • Memory computing: Data can be cached in the memory, and the data loaded once can be used many times, thus greatly reducing the running time.
  • More powerful scripting language features: For example, conditional statements, such as IF/ELSE and loop statements, such as LOOP, can be used in SQL.

My Personal Growth Experiences

Since I was recruited into Alibaba MaxCompute team in 2015, I have witnessed the development of MaxCompute from 1.0 to 2.0. Personally speaking, when I came to Alibaba after completing school, I felt like I had come to another “campus”. The team is based on the open-source community project Apache Calcite, and is also a heavy user. During the use process, the team has contributed some good ideas to the community and the community has actively accepted them. Therefore, it is a matter of course to become a committer.

Original Source

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

4.97K Followers

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com