Alibaba Cloud E-MapReduce Sets World Record Again on TPC-DS Benchmark

Image for post
Image for post

According to TPC-DS — Top Ten Performance Results as of April 26, 2020, Alibaba Cloud won first place as the world’s only cloud computing company on the list. It is worth noting that, in the last year, Alibaba Cloud E-MapReduce (EMR) broke world records on the 10,000 GB (10 TB) and 100,000 GB (100 TB) TPC-DS benchmarks for the first time, making it the world’s first public cloud product that won this honor. This year, EMR has increased its computing speed and is now 2.2 times as fast as it was in the last year, delivering an amazing result of 11,569,838 QphDS. It is the first product that achieves a higher score than 10,000,000. EMR is 3.5 times as fast as the best commercial big data processing product of a competitor. EMR continues to maintain its competitive advantage in processing large amounts of data. It is efficient in processing 100 TB data, which is 10 times the maximum data processing capacity of competitor products.

Image for post
Image for post

To view the complete results, visit http://www.tpc.org/tpcds/results/tpcds_perf_results5.asp?resulttype=all

EMR Breaks World Record Again

TPC-DS is the first benchmark for SQL-based big data systems. It has been published for eight years, but only three companies in the world (including Alibaba Cloud) have managed to pass the TPC-DS benchmark for their database software products. TPC-DS is well known for its SQL complexity, process completeness, and enormous data volume, making it the world’s most challenging big data benchmark.

Take the data volume as an example. Alibaba Cloud E-MapReduce (EMR) used a 10 TB dataset for this test, comprising simulated data of 1.3 billion products, 50 billion transactions, and 60 million users. Most of the queries must be executed simultaneously on this dataset to return the results within a few seconds or one minute.

What Is EMR?

Alibaba Cloud E-MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud Elastic Compute Service (ECS) instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, and Apache HBase, to analyze and process data. You can use EMR to process data stored in different Alibaba Cloud data storage services, such as Alibaba Cloud Object Storage Service (OSS) and Alibaba Cloud Log Service (SLS). Currently, EMR has been used in enterprises and institutions in various sectors such as new retail, Internet, education, artificial intelligence, and government affairs. Take Yeahmobi, a well-known international marketing service company, as an example. Yeahmobi has built a big data computing platform based on Alibaba Cloud EMR, to implement unified storage and analysis, and reducing the overall cost by more than 30%.

Image for post
Image for post

Powerful Engine: Jindo Spark

Jindo Spark is a cloud-native distributed computing and storage engine independently developed by the Alibaba Cloud E-MapReduce Team based on the open-source Apache Spark project. It has been deployed and used by nearly one thousand clients. Jindo Spark has been greatly optimized and extended based on Apache Spark. It deeply integrates and connects to many basic Alibaba Cloud services.

Compared with Apache Spark, Jindo Spark has significantly improved functionality and performance while also maintaining interface compatibility. For example, in this TPC-DS 10 TB benchmark, Jindo Spark successfully runs all test procedures. However, Apache Spark is unable to complete some procedures such as data update, and cannot return results normally for some queries (2 out of 99 SQL queries). Jindo Spark also shows a great advantage in query performance. The time Apache Spark takes to run the rest (97 out of 99) queries is 6.1 times as long as that required by Jindo Spark. For some queries such as Query 67 and Query 78, Jindo Spark is more than 100 times faster than Apache Spark. In EMR 4.0, Jindo Spark has the following highlights in performance and functionality:

Jindo Spark has upgraded the whole stage code generation framework (the core of open-source Spark SQL projects) to the native code generation framework. To improve the execution efficiency of the generated code, Jindo Spark integrates the Weld-IR technology. Jindo Spark also supports speculative compilation and global code caching. The native runtime computing engine of Jindo Spark has analyzed frequently used SQL operators, and optimized some of them to improve the performance. For example, it has optimized the SortMergeJoin and PartitionBy operators to improve the performance during the time-consuming shuffle stage.

In addition, Jindo Spark has improved Spark SQL Catalyst Optimizer to implement optimization based on common table expressions (CTE), primary keys (PK), and foreign keys (FK). It also supports dynamic runtime filters, which improve the performance of some TPC-DS SQL queries by dozens of times.

Alibaba Cloud EMR and OSS provide an all-in-one data lake solution that separates computing and storage. The latest JindoFS, a cloud-native file system, supports both Cache and Block modes, as well as the full range of EMR computing engines and HBase databases.

In the Block mode, JindoFS, combines the benefits of both local storage that features high-performance, and OSS storage that features large data volume, reliability, and low-cost. In other words, JindoFS boasts about a performance close to that of local storage, as well as the large storage capacity, elasticity, and cost-efficiency as that of OSS storage. JindoFS supports hot and cold data migration between local storage and OSS storage. You do not need to explicitly migrate or maintain metadata locations or perform explicit mounting. In the Cache mode, JindoFS supports the existing access methods of OSS while allowing users to cache and accelerate the metadata and file data as required. In either mode, JindoFS supports all kinds of EMR computing engines, including MapReduce, Spark, Hive, Flink, Impala, Presto, Kafka, and even HBase. JindoFS supports access from external non-EMR clusters, which streamlines the whole process.

Jindo Spark supports:

  1. Clusters that use different storage engines, such as Kafka, Kudu, Druid, and HBase, and deep integration with them to meet the requirements of various business scenarios that use a real-time data lake.
  2. Spark Streaming SQL and real-time extract-transform-load (ETL), which makes it easier for development and use.
  3. Synchronizing MySQL Binlog CDC data to the data lake in real time. You can use templates to quickly build data streams and query real-time data in the data lake (with engines such as Spark SQL, Presto, and Hive).

Spark Cube can persist relational data in any table or view to data storage, and implement capabilities similar to the materialized views and cubes in traditional data warehouses. Cache data storage supports a variety of data sources and formats, and many data organization methods, such as partitioning, bucketing, sorting, and file indexing. Jindo Spark automatically selects the appropriate cache to rewrite the execution plans for user queries to speed up query execution. By pre-organizing and pre-computing data, Spark Cube can respond to interactive analysis requests that involve ultra-large amounts of data in sub-seconds, and is suitable for scenarios such as multi-dimensional analysis, BI reports, and dashboards.

Spark Cube is now open source at https://github.com/alibaba/SparkCube. You are encouraged to use it and provide suggestions.

Looking Ahead

Alibaba Cloud VP Yangqing Jia said that “While actively embracing open-source technologies, Alibaba Cloud is also constantly developing and innovating new technologies. That is exactly why Alibaba Cloud managed to set world records for two consecutive years. Alibaba Cloud is eager to use these innovative technologies to benefit more enterprises.”

In the future, the EMR team will strive to further improve Jindo in terms of performance, functionality, and scalability. We will challenge even larger datasets and meet more extensive client requirements on processing big data on the cloud, developing EMR into a flagship product of Alibaba Cloud.

Image for post
Image for post

Scan the QR code to join Apache Spark community!
Let’s explore more open source big data platform solutions together!

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store