AnalyticDB: Alibaba Cloud’s Cloud-native Data Warehouse Broke the TPC-DS World Record
On May 4, 2020, Transaction Processing Performance Council (TPC), the world’s most famous non-profit data management system evaluator, officially published the results for AnalyticDB on the TPC-DS on its official website. AnalyticDB passed the rigorous TPC-DS tests with a Query-per-Hour Performance (QphDS) of 14895566 and a unit price of CNY 0.08, as shown in the following figure. AnalyticDB performed 29% better than the world record set by an in-depth optimization of Spark, at a unit price of two-thirds less.
These results make AnalyticDB the world’s leading data warehouse in terms of performance and cost-effectiveness on the TPC-DS benchmark, breaking its own record set on April 26, 2019. For more information, see TPC-DS Results.
Enterprises’ data needs are constantly changing, evolving from traditional big data to fast data. This is mainly manifested in the following four areas (some data is sourced from Gartner and IDC):
- Data volumes are growing explosively. The global data volume is expected to reach 40 ZB by 2020 and grow more than fourfold by 2025.
- Enterprises are migrating to the cloud at a significantly faster rate. By 2025, 50% of enterprise data will be stored in the cloud, while 75% of enterprise databases will run in the cloud.
- There is a strong demand for real-time data processing. By 2025, 30% of global data will be processed in real-time.
- Data is intelligence-oriented. With the development of artificial intelligence (AI) and 5G technologies, unstructured data is growing rapidly. By 2025, 80% of data will be unstructured data.
With the explosive data growth and full migration of enterprise data to the cloud, the key technical metrics of cloud-native data warehouses will be the storage and processing performance and cost-effectiveness of massive data. The TPC-DS test is a rigorous assessment and audit of a data warehouse in terms of:
- Data import
- Query performance (single query and multiple concurrent queries)
- Query complexity (covering star and snowflake schemas, and support for complex window functions)
- Availability (data consistency and fault tolerance for damaged disks)
TPC-DS is a core benchmark for measuring the maturity and competitiveness of a data warehouse.
The TPC-DS test of AnalyticDB is an important process to improve the product development capabilities of Alibaba and verify core technical breakthroughs. The core technical breakthroughs in this process are helping our customers improve the performance of real-time processes and greatly reduce costs, getting them ready for a new era of database and big data integration and online businesses.
AnalyticDB is Alibaba Cloud’s only proprietary PB-level real-time data warehouse that has been verified by using it in massive core businesses. Since its launch in 2012, nearly a hundred versions have been released, providing support for many online analytics businesses, such as e-commerce, advertising, logistics, entertainment, tourism, and risk control. Alibaba Cloud started providing AnalyticDB for external enterprises in 2014, initially serving traditional large and medium-sized enterprises, government institutions, and numerous internet companies in more than ten industries.
AnalyticDB for MySQL 3.0 is a cloud-native data warehouse, which has evolved based on advances in database and big data integration and engineering over the past eight years. In the TPC-DS test, AnalyticDB for MySQL 3.0 fully demonstrated its outstanding cloud-native technology advantages, proving that it is nearly 10 times better than its closest competition.
2) TPC-DS Performance Benchmarks
TPC is the world’s most famous non-profit data management system evaluator. It formulates the standards, performance, and price metrics for business application benchmarks and publishes the test results. The benchmark results are a core metric for measuring the performance and cost-effectiveness of a data management system.
The TPC-DS test simulates a typical decision support system for retail data warehouses. It is the most challenging database benchmark and is an upgraded version of TPC-H. It uses the star and snowflake schemas, and the test set includes complex applications such as big data set statistics, report generation, online query, and data mining, making it similar to a real scenario.
The following factors make TPC-DS challenging:
- The data set scale is large. For example, the store_sales table contains more than 28 billion rows.
- It is oriented to real retail decision-making scenarios, and its SQL statements are very complex. It covers the core parts of SQL-99 and SQL: 2003 as well as online analytical processing (OLAP) standards. It includes low-latency ad hoc queries of reports and massive data mining and high-throughput analysis queries.
- It involves many multi-dimensional test items, which require high performance, high reliability, high availability, and high cost-effectiveness, as well as extract, transform, load (ETL) and atomicity, consistency, isolation, durability (ACID) capabilities.
The following figure displays the TPC-DS test procedure and data model:
3) Technical Architecture of AnalyticDB for MySQL 3.0
AnalyticDB for MySQL 3.0 uses a cloud-native architecture that separates computing and storage and also separates cold and hot data. It supports high-throughput real-time writing and strong data consistency, ensuring high performance with hybrid loads of high-concurrency queries, and high-throughput batch processing.
The first layer is the access layer, which consists of Multi-Master scalable coordinators. The coordinators implement protocol access, SQL parsing and optimization, real-time data writing to shards, data scheduling, and query scheduling.
The second layer is the computing engine, which supports the integrated execution of distributed massively parallel processing (MPP) and directed acyclic graphs (DAGs). By using intelligent optimizer, it supports high-concurrency and complex SQL statements. Leveraging the cloud-native infrastructure, compute nodes implement elastic scheduling and can be scaled out in minutes or even seconds based on business needs, which ensures effective resource utilization.
The third layer is the Raft-based distributed, real-time, and high-availability storage engine. It implements parallelization based on data shards and multiple Raft groups, separates cold and hot data to reduce costs through tiered storage, and achieves high performance through row-column-mixed storage and intelligent indexing.
4) AnalyticDB Storage Technology
4.1) Distributed and Highly Consistent Storage
AnalyticDB for MySQL 3.0 was independently researched and developed by Alibaba Cloud. Based on the Raft protocol, it provides a distributed, highly consistent, and highly reliable lightweight storage architecture for high-throughput real-time data writing. Therefore, it is suitable for scenarios that require very high analytics performance.
Compared with HBase and Kudu, AnalyticDB for MySQL 3.0 provides higher performance in SQL analytics. It also provides high consistency and visibility in real-time writing and supports ACID, which is not supported by open-source Elasticsearch and ClickHouse.
The following figure shows the overall storage architecture of AnalyticDB.
AnalyticDB for MySQL 3.0 is a database-based parallel data model. It is implemented in a multi-layer parallel architecture based on storage models and MPP computing models.
- The first level is cluster instance-based parallelism. A user instance is divided into multiple storage node groups (worker groups), and each worker group consists of N (usually 3, or another odd number) worker nodes. A worker node is equivalent to a user data node container. Worker nodes are grouped to ensure stable communication during large-scale system expansion and to facilitate the parallel upgrade and O&M of the system.
- The second level is database-based parallelism. A user database is split into N shards. To ensure strong data consistency, each shard is an independent Raft group. Multiple shards form multiple Raft consensus groups. Shards can be Hash partitions or Range partitions. In general, data alignment can be performed in Hash partitions to avoid data shuffling when large data tables are joined. Shards can be balanced or migrated between different worker groups when necessary. Shards also support dynamic splitting and merging.
- The third level is table-based parallelism. Historical data storage in data warehouse scenarios involves data partitioning. For example, in TPC-DS, store_sales can be partitioned by time period. Data partitioning facilitates data lifecycle management and supports the query of partition pruning and dynamic file pruning (DFP). This helps greatly narrow down the data computing scope.
In the TPC-DS test, the distributed parallel storage architecture works closely with the query optimization and execution engine that perceives storage distribution, resulting in excellent overall performance.
4.2) High-performance Batch Import
Data import is the basic capability of a cloud data warehouse. Accordingly, TPC-DS requires high import performance.
The first optimization we made was to adopt a lightweight build, which converts real-time data into full partition data. AnalyticDB for MySQL 3.0 implements a lightweight local build with full memory and a single replica. Compared with the earlier MR-like full build, this greatly reduces the overhead of the Distributed File System (DFS) read and write and writing data to disks. It allows the system to make full use of the CPU to improve performance by localizing vector commands.
Then, we turned to I/O and network optimization. We used technologies such as DirectIO, Binary, full stream, asynchronization, and zero-copy to significantly improve import performance.
Next, we reduced the data volume. We used the Raft 2+1 technology (2 data records + 1 log record) to reduce data volumes by a third while ensuring data reliability. We further compressed the data by using the high-performance LZ4 compression algorithm, which greatly reduces the data read and write I/O and network transmission overhead.
Finally, we were able to import 50 million records per second on 18 nodes in TPC-DS.
4.3) High-throughput Real-time Data Update in DML
AnalyticDB for MySQL 3.0 provides high-throughput real-time data update capabilities based on Raft and achieves excellent performance by using the full asynchronization, zero-copy, and high-efficiency coding compression technologies during writing. In the TPC-DS Dynamic Mechanical Loading (DML) test, AnalyticDB can write tens of millions of transactions per second (TPS) in real-time while ensuring linear consistency (allowing query upon writing). In actual production environments, the writing performance is fully scalable, easily reaching 100 million TPS.
In TPC-DS, the data modification and ACID capabilities of the data warehouse need to be verified. AnalyticDB for MySQL 3.0 supports ETL transactions and ACID capabilities (supporting a complete TPC-C test). In the TPC-DS DML test, the Multiversion Concurrency Control (MVCC) capability of the storage engine assumes a major role. The storage engine is divided into real-time data (Delta), partition data (Main), and asynchronous data conversion (Build), creating an LSM-like write optimization architecture. AnalyticDB provides block-level MVCC and snapshot isolation to ensure the isolation (visibility) of data during ETL and data update and the atomicity of data update upon disk failures.
4.4) Row-column-mixed Storage and Intelligent Indexing
The proprietary row-column-mixed storage format enables AnalyticDB for MySQL 3.0 to ensure a high filtering rate and high throughput scanning. This format ensures better details checking than the column-only-storage format of open-source ORCFile, provides higher random read performance than Parquet, and has a lower cost than the data redundancy models of row- and column-oriented storage. In AnalyticDB for MySQL, each table has a row-oriented or column-oriented storage file. Data is divided into different row groups. In a row group, data is stored in blocks of columns. In a block, data with a fixed length or unfixed length (toast) is encoded and compressed and can be read randomly or sequentially.
In the TPC-DS test, we significantly improved storage scanning performance by properly configuring the storage block size (4 KB), data block prefetching, and source operator vector reading. Having accurate statistical information (such as min, max, sum, and cnt) for storage can accelerate Smart Scan and provide various statistics for the query optimizer, allowing it to formulate an optimal execution plan.
AnalyticDB for MySQL provides a proprietary intelligent indexing framework that supports five types of indexes: inverted indexes (string), bitmap indexes, KD-tree indexes (digit), JSON indexes, and vector indexes. Different types of indexes support combinations of multiple conditions (intersect, union, and except) at the column level. Unlike a traditional data framework, this framework does not require the creation of a compound index (which avoids space expansion) and supports Index Condition Pushdown (ICP) for more conditions such as OR and NOT. For greater ease of use, you can enable automatic full-column indexing when creating a table in AnalyticDB. ICP is dynamically selected through Cost-Based Optimization (CBO) during query, and the ICP index chains will be merged and output at the predicate calculation layer in a streaming and progressive manner.
5) AnalyticDB Query Technologies
The query engine of AnalyticDB for MySQL 3.0 consists of the proprietary query optimizer and query executor. This is an important component that enables AnalyticDB for MySQL to provide high-concurrency and high-throughput data warehouse analytics. Its core advantages over a single computing engine include its awareness of data characteristics, in-depth integration with the storage engine architecture, and support for data warehouse analytics scenarios such as reporting, ad hoc, and ETL.
The optimizer of AnalyticDB for MySQL is a distributed cloud-native real-time data warehouse product. It must be able to deal with the problems of a traditional optimizer such as NP-hard issues in complex Join Reorder and the uncertainty of cost estimates, as well as new problems arising from distributed parallel plans in a distributed environment. CBO, the latest achievement of AnalyticDB for MySQL 3.0, was first used in the TPC-DS test and has helped a lot in the overall plan optimization.
Based on the hybrid load management capability of unified memory pooling and queries, the query execution engine of AnalyticDB for MySQL uses technologies such as dynamic code generation, innovative hybrid execution models, vectorization algorithms of the single instruction, multiple data (SIMD) instruction set, and adaptive query execution for row-column-mixed storage. This ensures the superior query performance for AnalyticDB for MySQL in TPC-DS.
5.1) CBO Query Optimization Framework
The cost-based optimizer is essentially a complex search problem. To solve this problem, you need to start with the following factors:
- Search Framework: From the perspective of database development, the Cascades-based search framework has become the industry standard. The commercial database SQL Server and the open-source database GP/ORCA are implemented by Cascades. We also implemented the AnalyticDB for MySQL optimizer CBO based on Cascades. The core problem of the search framework is that the search space will expand rapidly but the search response time must be within milliseconds. Therefore, you need an efficient optimization rule to generate a search space, an efficient data structure to store the search space, an efficient search algorithm to traverse the search space, and an efficient pruning policy to crop the search space.
- Distributed Parallel Plan: The distributed MPP databases pose new challenges for the optimizer. In a distributed MPP database, the distribution of data directly affects the correctness of data. To meet the data distribution requirements of different operators, data redistribution is inevitable. However, the cost of data redistribution (data shuffling) is very high. Therefore, you need to reduce data shuffling as much as possible while also ensuring data correctness. For the distributed MPP database optimizer, the data partitioning, sorting, and grouping properties are required for the search space, and the optimal distributed parallel execution plan needs to be selected based on cost considerations.
- Cost Estimation: Cost estimation is a key factor that allows the optimizer to find the optimal plan. Cost estimation involves the derivation of statistical information and a cost model. The derivation of statistical information depends on the statistical information of the original table, the derivation algorithm of intermediate operators, and various data assumptions (homogenization assumptions, independence assumptions, and inclusion assumptions). Guesses must be made in extreme cases. Therefore, the derivation of statistical information is full of uncertainty, making it much more difficult for the optimizer to find the optimal solution. Essentially, we must break the assumptions about data properties to estimate statistics correctly. However, breaking these assumptions will increase our costs.
- Statistics Collection: We must collect the necessary statistics before a CBO runs. Basic statistical information is collected and updated automatically, while advanced statistics can be collected manually. This provides a reliable and multidimensional statistics basis for CBO. However, some statistical information may be lost or not collected on time. To prevent the generation of an inappropriate plan, you can dynamically collect sample data at runtime to obtain the necessary statistical information.
5.2) Hybrid Query Execution Framework
A common belief is that the traditional volcano model cannot meet the high-throughput performance requirements of analytics scenarios. With the continuous development of various systems, two types of post-evolution execution frameworks have been developed for computing engines in the industry:
- Just-in-time (JIT) Compilation: JIT compilation is data-centric. After the previous operator processes a piece of data, the next operator will compute it directly in the CPU cache. JIT compilation is cache-friendly and suitable for computing-intensive tasks.
- Vectorization: In vectorization, an operator processes a batch of data and then sends the results to the next operator for computing. Vectorization is suitable for memory-intensive tasks and vector computing and achieves high cohesion of operator computing at the cost of overhead for materializing intermediate results.
In the preceding figure, the red column indicates JIT compilation, while the green column indicates vectorization. AnalyticDB for MySQL is the only proprietary analytics engine that supports these two query modes. The hybrid execution framework adaptively integrates multiple computing-intensive operators into a drive based on the vectorization mode. It empowers a query execution engine that features both compilation and vectorization.
5.3) Unified Memory Management
Efficient memory management is the cornerstone of computing optimization. In a type-oriented memory model, different basic type storages are used for different data types. As a result, different types of data cannot be stored in contiguous memory addresses. Instead, the data can be stored only by column, reducing the additional cost of multiple memory objects. In addition, memory cannot be reused across different memory types, which results in additional memory management costs.
The query execution engine of AnalyticDB for MySQL solves the preceding problems through unified memory management.
- Binary memory unifies the memory type to store different types of data in the same type (byte). This is the basis for row-oriented storage and cache-friendly algorithm optimization to facilitate query execution.
- Standardized memory management specifications unify the memory specifications to reduce the extra costs due to memory fragmentation and the difficulty of memory reuse.
- Hierarchical memory management unifies memory management, maps memory lifecycles based on computing characteristics, implements MemoryCache and MemoryPool based on memory usage characteristics, and supports memory leak detection. This implements active memory management oriented to resident services.
5.4) DFP and CTE Technologies
Join operations between a fact table and a dimension table are typical in a data warehouse. The ratio of the data volumes of the two tables can reach into the tens of millions. The computing costs of Join operations arise mainly from data scanning. Therefore, we use the DynamicFilterPushDown method to greatly reduce the data volume of the left table. In addition, the data warehouse contains a large number of WITH statements and implicitly shared statements. In this case, common table expressions (CTEs) can be shared to avoid repeated computing.
After DynamicFilterPushDown (DFP) reads the Join operation with a high filtering rate (low hit rate) and the data on the probe from the storage, most of the data will be discarded. Therefore, if the evaluated build data is maintained below a small threshold, we can use the build result as the filter condition (dynamic filter) for the left table and push down the data to the storage, reducing the scanning amount. The optimizer’s primary task is to properly evaluate the number of distinct values (NDV) that hit the Join condition at the build end.
Different Join orders directly affect the range and granularity of dynamic filters. The cost of a Join operation that can be optimized is significantly different from that of a real Hash join, which affects the Join orders. Based on the complete and scalable CBO framework of AnalyticDB for MySQL, we selected the globally optimal dynamic filter solution based on cost considerations.
We implement an efficient DFP at the following three key points:
- Efficient Building of Dynamic Predicates: This allows you to build dynamic predicates by using in-place in the process, which reduces the cost of building dynamic words.
- Multi-layer Filtering Optimization: Bloom Filter is used to prune partitions and perceive storage indexes, which accelerates filtering.
- Disparate Data Source Push-down: Implementation of the unified data source interface layer is abstracted to support more disparate data sources.
For a CTE, more than 30% of SQL statements in TPC-DS contain the “with as” subquery. The “with as” subquery is referenced multiple times in the main query, and each reference causes repeated computing, resulting in a waste of resources. For basic CTE optimization, the results of the “with” clause are reused for multiple references to reduce the cost of repeated computing. However, in some scenarios, derivation of the relationship with the main query can further reduce the amount of computing in the “with” subquery. In this case, directly sharing the complete “with” clause will lead to additional performance rollback. Then, the optimal plan generated after inline can be used to identify common subtrees. This further reduces the amount of repeated computing and ensures no bad cases. In implementing the executor, we introduced deadlock detection to solve the deadlock problem by analyzing the dependencies among multiple consumers of common subtrees.
6) Customer Scenarios and Cases
Since its debut in the cloud in 2014, AnalyticDB has been available in major regions around the world and has entered the global analytics market. In 2018, AnalyticDB was selected as a contender in Forrester’s research report “The Forrester Wave™: CloudData Warehouse, Q4 2018” and Gartner’s “Magic Quadrant for Data Management Solutions for Analytics” report.
AnalyticDB is widely used in the Alibaba Group and by our customers to replace Teradata and Oracle RAC. It has been applied and verified on a large scale in core industries such as the internet, new retail, finance, taxation, transportation, meteorology, and logistics. For example, in the logistics industry, AnalyticDB has enabled China Post to perform large-scale centralized query and analysis on nationwide logistics data for the first time.
7) Summary and Outlook
AnalyticDB has been verified theoretically by the Very Large Data Bases (VLDB) paper, “AnalyticDB: Real-time OLAP Database System at Alibaba Cloud”. This is an honor reserved for top large-scale commercial systems, such as Google F1 (VLDB’2013) and AWS Aurora (SIGMOD’2017). AnalyticDB has also proven itself through the TPC-DS benchmark (proving itself to be the world-leader in cost-effectiveness and performance), through customer feedback (including major government ministries and internet-scale customers), and through successful application in the Alibaba Group over many years. We have now upgraded AnalyticDB to a cloud-native data warehouse based on the highly efficient cloud computing and the integration of databases and big data.
In the future, big data and database integration and cloud-native will redefine the data warehouse for the cloud computing era. Breaking the TPC-DS world record is just a start. AnalyticDB will continue to strive to become an infrastructure for digital transformation and upgrade and a way for enterprises to realize the value of online data.