The Secrets Behind the Optimized SQL Performance of EMR Spark

Preface

Determination to Reach the Top of the TPC-DS-Perf Rankings for the Third Time

Comparing Open-Source Spark and EMR Spark

Nearly 300% Performance Improvement in the Load Phase

Nearly 600% Performance Improvement in the PT Phase

The Queries that took Spark Community Edition 2.4.3 more than 200 Seconds to Execute were Singled Out for Comparison with the Corresponding Queries executed by EMR Spark.

Optimizations

Optimizers

Common table express (CTE) materialization based on InMemoryTable Cache
Dynamic partition pruning:
Small table broadcast reuse:
Bloom filter before SMJ:
Primary key (PK) and foreign key (FK) constraints optimization:
RI-Join removal:
Removal of non-PK columns from GroupBy keys:
GroupBy Push Down before JoinFast Decimal

Runtime

Summary

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com