For most business production environments, complex SQL queries is a challenging yet necessary component. In scenarios such as statistical report analysis, Service application systems perform complex SQL queries in the backend or similar systems after storing data using the OLTP database. However, these queries are time consuming and resource intensive especially when working with large datasets.
To satisfy the quick response requirements for complex SQL queries, the Alibaba Cloud Distributed Relational Database Service (DRDS) team introduced the self-developed MPP multi-machine concurrent computing engine (Fireworks) and corresponding optimization policies based on the third-generation distributed SQL engine, greatly enhancing the complex query processing capability of DRDS.
The engine can return results in seconds for distributed multi-table join, aggregate, sort, and sub-query operations in databases with tens of millions of data entries, which greatly improves the response speed. In addition, the engine uses the same RDS read-only data for processing, eliminating the necessity of data synchronization to other data sources. This helps reduce the overall link complexity of the service architecture and lower business operation and maintenance costs and budgets.
Main Features of DRDS
Self-Developed MPP Multi-Machine Concurrent Computing Engine ¨CFireworks
The DRDS read-only instance hosts an SQL execution engine (Fireworks) that features complete multi-machine concurrent processing capabilities. The engine is significantly different from the SQL execution engine hosted on the DRDS primary instance.
The execution engine of the DRDS primary instance adopts a stand-alone architecture, which pushes as many computing tasks to the physical sub-databases on the bottom layer as possible. It realizes logically distributed SQL computing based on the computing capabilities of physical sub-databases.
The Fireworks engine hosted on the DRDS read-only instance is a cluster consisting of multiple computing nodes. It converts an SQL query into a distributed computing task. This eliminates the computing capability limits of physical sub-databases, greatly improving the complex logical SQL computing speed, especially for computing tasks such as join, aggregate, and sort.
The Fireworks engine shuffles and breaks up a join, aggregate, or sort computing task and distributes the sub-tasks to the nodes in the computing cluster, thus achieving computing speedup through multi-node concurrent computing.
An Optimizer Customized for the Multi-Machine Concurrent Execution Mode
The original optimizer for the DRDS primary instance focuses on OLTP scenarios. Its core concept is to push all computing tasks to its physical sub-databases for execution if possible. Its purpose is to fully utilize the computing resources of physical databases and avoid massive data flow for quick response.
The overall performance is restricted by the physical databases in complex query scenarios with large data volume. Its stability is also affected due to the high pressure on the physical databases. In summary, the overall OLAP capability has many limitations.
The introduction of the MPP multi-machine concurrent computing engine (Fireworks) greatly improves the computing capability of DRDS itself. The overall optimization policies are also adjusted for the optimizer:
- Execute as many complex computing tasks (such as join, aggregate, and sort tasks) as possible on the execution engine of the DRDS itself, realizing computing speedup and scalability through the Fireworks computing cluster.
- Push lightweight computing tasks (such as project and filter tasks) to the physical database to reduce data pulling costs.
The DRDS distributed SQL optimizer optimizes the execution scheduler in the finest granule into one friendly to the multi-machine concurrent execution engine for higher execution efficiency.
In addition, the refined operator pushing policy is provided. Operators with less pressure on the RDS are pushed to the physical databases for higher computing cost effectiveness. In this way, the RDS is protected from the impact of high-cost operators, ensuring online traffic stability.
Direct Analysis based on Online Data
Emerging Internet services, represented by the new retail business, are springing up. The industry features not only real-time OLTP requirements, but also complex quasi-real time OLAP requirements, to support real-time decision making.
At present, in most solutions for data analysis scenarios, the production data is exported from the OLTP database to other data sources for offline analysis. This traditional mechanism can hardly satisfy the quasi-real-time requirements, and meanwhile entails data loss risks during data export to the offline system.
Rather then executing cumbersome data synchronization tasks, the DRDS read-only instance directly performs complex data processing based on the RDS read-only instance or RDS primary instance, which greatly lowers down the overall link complexity of the service architecture and reduces business operation and maintenance costs and budgets.
In addition, in avoiding data synchronization, the DRDS read-only instance ensures data processing timeliness, realizing the real-time level as high as READ COMMITED (based on the RDS primary instance).
SQL Compatibility with Clear Boundaries
The DRDS read-only instance is comprehensively compatible with the SQL query syntax of the DRDS primary instance and is highly consistent with DRDS V 5.3 in terms of SQL compatibility and SQL supporting boundaries.
Compared with similar products, it has high compatibility and clear boundary support, providing an experience almost identical to that of the DRDS primary instance.
Complex SQL queries that are executed slowly or cannot be executed on the DRDS primary instance can be directly migrated to the read-only instance for execution, eliminating the additional overhead of SQL rewriting.
Flexible Product Experience
The DRDS read-only instance automatically synchronizes the account privilege information of the DRDS primary instance and supports the native VPC. You can enable it on both the Intranet or Internet and adjust the configuration flexible based on the business situation. It will foster linear improvement on data processing capabilities.
Technical Architecture Overview
The overall architecture of the DRDS read-only instance is basically the same as that of the DRDS primary instance. Only the query layer is changed, being added with the MPP execution engine and the corresponding optimizer, as shown below:
The DRDS protocol layer processes network interactions and parses the MySQL protocol. After receiving a query request, the DRDS protocol layer forwards the SQL query to the query layer for processing. The query layer parses the SQL query. The executor generates an optimized execution scheduler, which is then passed by the execution engine to the storage layer for query and computing.
If the Fireworks engine is required for computing, the query layer further converts the execution scheduler to a distributed execution scheduler and submits it to the Fireworks cluster as a distributed task after obtaining the execution scheduler. The remote Fireworks cluster completes the data query on the storage layer and the follow-up computing work.
In short, the DRDS read-only instance may be considered as the original DRDS primary instance with a execution link with multi-machine concurrent processing capabilities.
In general, the DRDS read-only instance is applicable to complex queries in high data volume levels featuring low concurrency and high latency, such as data analysis and report scenarios, typical for its large number of join, aggregate, and sort operations performed on a large volume of data.
At present, the DRDS read-only instance has been running for several businesses in Alibaba Group, represented by new retails such as Hema and ET Brain. The DRDS read-only instance can perform correlation analysis in multiple dimensions, such as people, goods, fields, and warehouses, and join, aggregate, and sort several or even dozens of logical tables scattered in different logical databases for business needs such as inventory reconciliation and decision making.
The DRDS read-only instance eliminates the necessity of configuring and maintaining a large number of data synchronization links and solves the issues of poor timeliness or inaccuracy of results due to data out of synchronization for business development personnel. It lightens work loads for them.
For users who are already using the DRDS, the DRDS read-only instance can address the following known issues:
- Some complex SQL queries involving join, aggregate, and sort operations may require secondary computing in the DRDS execution engine because they cannot be completely pushed to the physical databases, and yet such computing cannot be executed due to the memory limitation of the stand-alone execution engine.
- Complex computing parts of an SQL query can be pushed to physical databases. However, if the query involves a large volume of data, due to the high pressure on physical databases, the OLTP service is affected, or the response time is too slow to meet the requirements.
For a long time, the DRDS has been unable to provide good support for complex queries involving large data scales or implement linear scaling upon its local computing capabilities through physical resource expansion due to the limitation of the execution engine in stand-alone architecture.
The introduction of the DRDS read-only instance completely compensates for the shortcomings of the DRDS in OLAP scenarios, enabling the DRDS to provide scalable OLAP capabilities while providing strong OLTP capabilities. It provides a convenient one-stop overall solution for users with both OLTP requirements and mid-range data analysis needs.
In the following six months, the DRDS read-only instance will be launched with the cross¨Clogical database correlation query function. More technological measures will be taken to continuously enhance its core capabilities, including the concurrency, response speed, data volume, interactive query, etc., so as to meet the enterprise applications’ stringent requirements for databases.
To learn more about Alibaba Cloud Distributed Relational Database Service (DRDS), visit www.alibabacloud.com/product/drds