MaxCompute SQL Join Development Guide

By Sixiang

Background

The results in this article are all derived using standard SQL semantics. We hope that you can write SQL statements according to the standard SQL semantics, to ensure the portability of subsequent SQL statements.

JOIN ON Condition in MaxCompute SQL

One of the most common operations in MaxCompute SQL is Join. Currently MaxCompute provides several Join types:

Image for post
Image for post

Depending on the scenario, the user can use different Join types to implement the corresponding Association operation. But during actual implementation, users aren’t always aware that the filtering criteria are different for JOIN ON and WHERE statements, or think they doing the same thing, for example, in a production environment, users can often be seen writing:

The user intends to get the data for a partition in A and B for the Join operation, that is:

However, for different Join types, the two may not be equivalent. It cannot implement partition condition pushdown, causing both full table scanning and correctness issues. Here, we will briefly analyze the similarities and differences between the filter conditions.

  1. WHERE condition of subquery
  2. JOIN ON condition
  3. WHERE condition after JOIN ON

1. Principle

First, we describe the computing order of JOIN and WHERE conditions, for the following code:

the computing order is

  1. {subquery_where_condition} in the subquery
  2. {on_condition} of JOIN
  3. The computing of the JOIN result set {where_condition}

For different Join types, filter clauses are placed in {subquery_where_condition},{on_condition}, and {where_condition}. Sometimes the results are consistent, and sometimes they are not. We discuss for different conditions:

2. Putting Everything to Practice

2.1 Preparation

First we create a table A:

keyds120190121220190121220190122

Then their Cartesian product is:

Image for post
Image for post

2.2. Inner Join

Conclusion: The filter conditions in {subquery_where_condition}, {on_condition}, and {where_condition} are equivalent.

The processing logic of Inner Join is to create a Cartesian product with the left and right tables, then output rows that meet the ON expression.

The first case is to filter data by using a subquery:

There is only one result.

a.keya.dsb.keyb.ds120190121120190121

The second case is to filter data by using the JOIN condition:

There are 9 results, and only one of which meets the ON condition.

a.keya.dsb.keyb.ds120190121120190121

The third case is to filter data by using the WHERE condition after JOIN.

Of the 9 Cartesian product results, three results meet the ON condition “a.key = b.key”:

Image for post
Image for post

Filter this result set by using the condition “A.ds=’20190121' and B.ds=’20190121'”. Only one result is returned, which is consistent with the previous result.

a.keya.dsb.keyb.ds120190121120190121

2.3 Left Join

Conclusion: The filtering conditions in {subquery_where_condition}, {on_condition}, and {where_condition} are not necessarily equivalent.

For the filtering conditions for the left table, the ones placed in {subquery_where_condition} and {where_condition} are equivalent.

For the filter conditions for the right table, the ones placed in {subquery_where_condition} and {on_condition} are equivalent.

The processing logic of Left Join is to make the left and right tables a Cartesian product, then output rows that meet the ON expression. For the rows in the left table that do not meet the ON expression, the left table is output, and the NULL is returned in the right table.

The first case is to filter data by using a subquery:

Image for post
Image for post

The second case is to filter data by using the JOIN condition:

Of the nine Cartesian product results, also only one result meets the ON condition. The two remaining entries in the left table are returned and NULL are returned in the right table.

Image for post
Image for post

The third case is to filter data by using the WHERE condition after JOIN.

Of the 9 Cartesian product results, three results meet the ON condition “a.key = b.key”:

Image for post
Image for post

Filter this result set by using the condition “A.ds=’20190121' and B.ds=’20190121'”. Only one result is returned.

a.keya.dsb.keyb.ds120190121120190121

As you can see, three different results have been obtained by placing the filter conditions in three different places.

2.4 Right Join

The Right Join and Left Join are similar, with the difference being between the left and right tables.

Conclusion: The filtering conditions in {subquery_where_condition}, {on_condition}, and {where_condition} are not necessarily equivalent.

For the filter conditions for the right table, the ones placed in {subquery_where_condition} and {where_condition} are equivalent.

For the filter conditions for the left table, the ones placed in {subquery_where_condition} and {on_condition} are equivalent.

2.5 Full Join

Conclusion: Putting filter conditions in {subquery_where_condition}, {on_condition}, and {where_condition} are non-equivalent.

The processing logic of Full Join is to make the left and right tables a Cartesian product, then output rows that meet the ON expression. For the rows on both tables that do not meet the ON expression, the table with data is output, and NULL is returned in the other table.

The first case is to filter data by using a subquery:

After filtering, there are two on the left and two on the right, and three in the results:

Image for post
Image for post

The second case is to filter data by using the JOIN condition:

Of the nine Cartesian product results, also only one result meets the ON condition. The two remaining entries in the left table are returned and NULL are returned in the right table. The two remaining entries in the right table are returned and NULL values are returned in the left table.

Image for post
Image for post

The third case is to filter data by using the WHERE condition after JOIN.

Of the nine Cartesian product results, three results meet the ON condition “a.key = b.key”:

Image for post
Image for post

Rows in one table that have no matches in the other table are returned, and NULL is put for the rows in the table where matches are not found to obtain a result set:

Image for post
Image for post

Filter this result set by using the condition “A.ds=’20190121' and B.ds=’20190121'”. Only one result is returned.

a.keya.dsb.keyb.ds120190121120190121

We can see that, similar to LEFT JOIN, three different results are obtained.

2.6 Left Semi Join

Conclusion: Putting a filtering condition in {subquery_where_condition}, {on_condition}, and {where_condition} are equivalent.

The processing logic of LEFT SEMI JOIN is that it tries to find a match in the right table for each entry in the left table and returns entries in the left table for which matching entries are found in the right table. Note that filtering conditions cannot be written in the WHERE condition for the right table after JOIN because LEFT SEMI JOIN only returns entries from the left table. LEFT SEMI JOIN is often used to implement the EXISTS semantics.

The first case is to filter data by using a subquery:

After filtering, two matching entries are found respectively in the right table and left table, but only one entry matches the “a.key = b.key” condition.

a.keya.ds120190121

The second case is to filter data by using the JOIN condition:

Of the three records in the left table, also only one meets the ON condition.

a.keya.ds120190121

The third case is to filter data by using the WHERE condition after JOIN.

Only one entry in the left table meets the ON condition.

a.keya.ds120190121

Use “A.ds=’20190121'” to filter the results. Again the result set only has one matching entry.

a.keya.ds120190121

We can see that, like INNER JOIN, LEFT SEMI JOIN always obtains the same result, regardless of the position of the filtering condition.

2.7 Left Anti Join

Conclusion: Putting a filtering condition in {subquery_where_condition}, {on_condition}, and {where_condition} respectively are not necessarily equivalent.

For the filtering conditions for the left table, the ones placed in {subquery_where_condition} and {where_condition} are equivalent.

It is also equivalent to put the filtering condition for the right table in {subquery_where_condition} and {on_condition}. However, the expression for the right table cannot be put in {where_condition}.

The processing logic of LEFT ANTI JOIN is that it tries to find a match in the right table for each entry in the left table and returns entries in the left table if none of the entries in the right table are matched. Similarly, filtering conditions cannot be written in the Where condition for the right table after JOIN because LEFT ANTI JOIN only returns entries from the left table. LEFT SEMI JOIN is often used to implement the NOT EXISTS semantics.

The first case is to filter data by using a subquery:

After filtering, two entries are returned from the right table and two from the left table. The final result set only contains one entry.

a.keya.ds220190121

The second case is to filter data by using the JOIN condition:

Of the three entries from the left table, only one meets the ON condition, so the other two entries are returned.

a.keya.ds220190121220190122

The third case is to filter data by using the WHERE condition after JOIN.

Two entries in the left table meet the ON condition.

a.keya.ds220190121220190122

Then use “A.ds=’20190121'” to filter the results. Only one entry meets the this condition.

a.keya.ds220190121

We can see that, for the LEFT ANTI JOIN operation, putting the same filtering condition in the JOIN ON condition and the WHERE condition respectively will return different results.

The preceding examples are only a test of writing filtering conditions in different places in the same application scenario and do not include the specific deduction process. This is even more complicated in scenarios where inequivalent expressions are involved. If you are interested, you can start deducing for yourself.

3. Summary

Putting the same filtering condition in different places may lead to drastically different semantics. If you want to perform JOIN operations only after data is filtered, simply remember the following considerations. If you still consider these rules too complex, the best way is to always write filtering conditions into subqueries, although this may look a little wordy.

  1. For INNER JOIN/LEFT SEMI JOIN, you can write filtering expressions anywhere.
  2. For LEFT JOIN/LEFT ANTI JOIN, you need to put a filtering condition for the left table in {subquery_where_condition} or {where_condition} and a filtering condition for the right table in {subquery_where_condition} or {on_condition}.
  3. RIGHT JOIN is opposite to LEFT JOIN: You need to put a filtering condition for the right table in {subquery_where_condition} or {where_condition} and a filtering condition for the left table in {subquery_where_condition} or {on_condition}.
  4. For FULL OUTER JOIN, filtering conditions can only be put in {subquery_where_condition}.

Reference:https://www.alibabacloud.com/blog/maxcompute-sql-join-development-guide_594737?spm=a2c41.12825681.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store