Alibaba Cloud E-MapReduce Sets World Record Again on TPC-DS Benchmark

EMR Breaks World Record Again

TPC-DS is the first benchmark for SQL-based big data systems. It has been published for eight years, but only three companies in the world (including Alibaba Cloud) have managed to pass the TPC-DS benchmark for their database software products. TPC-DS is well known for its SQL complexity, process completeness, and enormous data volume, making it the world’s most challenging big data benchmark.

What Is EMR?

Alibaba Cloud E-MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud Elastic Compute Service (ECS) instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, and Apache HBase, to analyze and process data. You can use EMR to process data stored in different Alibaba Cloud data storage services, such as Alibaba Cloud Object Storage Service (OSS) and Alibaba Cloud Log Service (SLS). Currently, EMR has been used in enterprises and institutions in various sectors such as new retail, Internet, education, artificial intelligence, and government affairs. Take Yeahmobi, a well-known international marketing service company, as an example. Yeahmobi has built a big data computing platform based on Alibaba Cloud EMR, to implement unified storage and analysis, and reducing the overall cost by more than 30%.

Powerful Engine: Jindo Spark

Jindo Spark is a cloud-native distributed computing and storage engine independently developed by the Alibaba Cloud E-MapReduce Team based on the open-source Apache Spark project. It has been deployed and used by nearly one thousand clients. Jindo Spark has been greatly optimized and extended based on Apache Spark. It deeply integrates and connects to many basic Alibaba Cloud services.

1) Native Runtime Computing Engine

Jindo Spark has upgraded the whole stage code generation framework (the core of open-source Spark SQL projects) to the native code generation framework. To improve the execution efficiency of the generated code, Jindo Spark integrates the Weld-IR technology. Jindo Spark also supports speculative compilation and global code caching. The native runtime computing engine of Jindo Spark has analyzed frequently used SQL operators, and optimized some of them to improve the performance. For example, it has optimized the SortMergeJoin and PartitionBy operators to improve the performance during the time-consuming shuffle stage.

2) Significant Upgrade of the Data Lake Solution

Alibaba Cloud EMR and OSS provide an all-in-one data lake solution that separates computing and storage. The latest JindoFS, a cloud-native file system, supports both Cache and Block modes, as well as the full range of EMR computing engines and HBase databases.

3) A Variety of Real-Time Data Streams

Jindo Spark supports:

  1. Spark Streaming SQL and real-time extract-transform-load (ETL), which makes it easier for development and use.
  2. Synchronizing MySQL Binlog CDC data to the data lake in real time. You can use templates to quickly build data streams and query real-time data in the data lake (with engines such as Spark SQL, Presto, and Hive).

4) Spark Cube

Spark Cube can persist relational data in any table or view to data storage, and implement capabilities similar to the materialized views and cubes in traditional data warehouses. Cache data storage supports a variety of data sources and formats, and many data organization methods, such as partitioning, bucketing, sorting, and file indexing. Jindo Spark automatically selects the appropriate cache to rewrite the execution plans for user queries to speed up query execution. By pre-organizing and pre-computing data, Spark Cube can respond to interactive analysis requests that involve ultra-large amounts of data in sub-seconds, and is suitable for scenarios such as multi-dimensional analysis, BI reports, and dashboards.

Looking Ahead

Alibaba Cloud VP Yangqing Jia said that “While actively embracing open-source technologies, Alibaba Cloud is also constantly developing and innovating new technologies. That is exactly why Alibaba Cloud managed to set world records for two consecutive years. Alibaba Cloud is eager to use these innovative technologies to benefit more enterprises.”

Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store