By Sigan, senior database expert from Alibaba Cloud ApsaraDB
Databases play an increasingly important role in our business systems. Like other companies, most Alibaba services rely heavily on databases. A minor database error may cause a significant impact on our business. Stabilizing and constantly optimizing the databases have always been crucially important.
Database optimization is usually implemented in three levels:
1) Application level: Optimizes application code logic to process data more efficiently.
2) Instance level: Adjusts environment parameters to improve the operating efficiency of instances.
3) SQL level: Ensures data is optimally retrieved through physical database design and SQL statement modification.
Developers are usually familiar with the first two levels and somewhat unfamiliar with SQL optimization, which may even lead to disputes over who should be responsible (the database administrator or application developer). However, SQL optimization is critical to database optimization. Poor SQL performance may lead to low execution efficiency or even interrupted services.
Based on experience, about 80% of database performance problems can be solved through SQL optimization. However, SQL optimization is a very complex process that requires a wide range of database expertise and experience.
For example, it requires knowledge about how to accurately identify bottlenecks in an execution plan and optimize the execution plan by improving the physical database design or rewriting SQL statements. In addition, SQL optimization is time-consuming and costly due to the large and constantly changing SQL workload and basic data.
SQL diagnosis and optimization is one of the most important services of Alibaba Cloud Database Autonomy Service (DAS). DAS performs diagnostic analysis by using SQL statements as the input and provides expert optimization suggestions such as indexing, statement optimization, and expected benefits. You can obtain expert suggestions on SQL diagnosis, rewriting, and optimization to maximize SQL execution performance, without profound knowledge about database optimization.
In addition, based on this capability, the DAS auto-SQL optimization service has pushed SQL optimization to a higher level. The service transforms passive optimization from relying on manual work into intelligence-based active optimization, enabling unattended SQL optimization through an autonomous process.
Next, we will explain the capability building process for the DAS SQL diagnosis and optimization service in detail.
When it comes to the diagnosis and optimization capabilities, we naturally think of two questions:
- 1) Are the capabilities reliable?
- 2) Are the capabilities comprehensive?
Answering these two questions perfectly poses several challenges:
Challenge 1 — How can we select a reliable optimization recommendation algorithm to generate reliable suggestions?
In the field of SQL diagnosis and optimization, two types of optimization algorithms are commonly used: the rule-based and the cost model-based algorithms. Because of its simplicity and ease of implementation, rule-based recommendation is widely used in many products and services nowadays. It is perfect for databases that lack the WHAT-IF kernel capability, for example, MySQL databases. However, this recommendation algorithm is inflexible and thus cannot ensure the recommendation quality. Let’s have a look at the example about recommending indexes for the following SQL query:
WHERE time_created >= '2017-11-25'
AND consuming_time > 1000
ORDER BY consuming_time DESC
In most cases, the following four candidate indexes are generated based on a specified rule:
Which index (or indexes, depending on the AND or OR relationships between indexes) is (or are) finally recommended to users? It is difficult to give an exact answer. This example only uses a simple single-table SQL query. What should we do with complex SQL queries (for example join-table queries) and queries with complex subqueries? The recommendation accuracy is even lower.
To solve this problem, the DAS SQL diagnosis and optimization service is implemented based on cost models, in the same way as the database optimizer. It quantitatively evaluates all possible candidate recommendations (or at least most of them in some extreme cases, where a suboptimal solution is obtained for NP optimization problems) and finally makes recommendations based on the execution costs. Using this algorithm, we find ourselves faced with other challenges when providing support for open-source databases such as MySQL.
- Lack of the WHAT-IF kernel capability: We cannot use the kernel’s database optimizer to quantitatively evaluate the costs of the candidate optimization plans.
- Lack of statistical information: Evaluating candidate optimization plan cost is essentially calculating the execution plan cost, which cannot be done without statistical information.
Challenge 2 — How can we ensure sufficient SQL compatibility?
The method that the SQL diagnosis and optimization service uses to ensure SQL compatibility (which involves SQL parsing and SQL semantic verification) directly affects the comprehensiveness of the service and the success rate of diagnosis. Either incompatibility or insufficient compatibility is an issue.
Challenge 3 — How can we build a capability-based test set with sufficient coverage?
Building SQL diagnosis and optimization capabilities have long been a challenging topic. The challenges lie not only in how to incorporate expert knowledge about database optimization but also how to build a comprehensive test-case library to verify the core capabilities. The test-case library is like a ruler that measures the capabilities and the driving force that accelerates the capacity building process. Therefore, an accurate test-case library with sufficient coverage is essential for capability building throughout the process.
However, building a desirable test-case library is very difficult. The two main challenges are as follows:
(1) Many factors affect SQL optimization. For example, hundreds of factors affect index selection, and the factors can form various combinations. As a result, we have an enormous feature set for test cases. It also requires significant effort to map these features to test cases.
(2) Professional knowledge and massive amounts of information are required for designing test cases, even for a single test case. For example, an index recommendation test case involves:
Challenge 4 — How can we build diagnosis and optimization capabilities for a large number of SQL statements?
The SQL diagnosis and optimization service is expected to serve millions of database instances in the cloud. Its online service capabilities also face great challenges, such as service-oriented division of complex computing services, horizontal scaling of computing services, maximum parallelism, concurrency control in a distributed resource access environment, effective scheduling based on different priorities, and peak traffic buffering.
2) Capability Building
To address the aforementioned challenges, we built an SQL diagnosis and optimization engine and a capability test set for SQL diagnosis optimization. This section describes the core technical architecture of this engine, and how we built the test set.
2.1 Core Technical Architecture
Figure 1 shows the core technical architecture of the SQL diagnosis and optimization engine, which consists of optimizers independent of the database. This engine adaptively collects statistics and calculates the costs of the execution plans, to address the lack of the WHAT-IF kernel capability and statistical information. The engine works in the following process:
- SQL parsing and verification: The engine parses query statements, verifies whether they meet the requirements, identifies the components of the query statements to form a syntax tree comprising predicates, predicate types, sorting fields, aggregation fields, and query fields, and determines the data types of fields in query statements. The engine also verifies whether the tables and fields used in SQL statements comply with the schema design of the target database.
- Candidate index generation: The engine generates multiple candidate index combinations based on the syntax tree formed during SQL parsing and verification.
- Cost evaluation: The engine uses a built-in optimizer independent of the database kernel to retrieve database statistics and stores the statistics in the engine’s cache. The built-in optimizer calculates the costs, based on statistics. It evaluates the cost of each index and SQL rewriting method, then selects the optimal index or SQL rewriting method based on the costs.
- Index merging and selection: The input of the engine can be one or more query statements or all query statements of the entire database instance. During index recommendation for multiple query statements, indexes recommended for different query statements may be identical or similar to each other or have the same prefixes. In some cases, physical indexes identical or similar to the recommended ones may already exist or share the same prefixes with them.
2.2 Building a Capability-based Test Set
As mentioned earlier, our objective is to build a capability-based test set with sufficient coverage. We would use it as a ruler to measure the capabilities, and a driving force to accelerate the capability building process. In this process, we have built a development model centered on the test case system, as shown in Figure 2.
The basic idea for building the capability-based test set is to formalize the feature-based description of test cases through feature extraction, to build a sufficiently complete, formal feature-based test-case library.
Within Alibaba Group, we have been collecting and storing all SQL statements executed in Alibaba’s internal database instances in real-time. Based on Alibaba’s diversified services and SQL scenarios and formalized feature descriptions, we have analyzed all our online SQL resources and searched for real cases that match the specific features, extract the required information, and finally build a test case library. Note that all data in the test case library was collected from Alibaba Group’s internal business systems. All online data such as statistical information has been encrypted and desensitized. This process is fully automated.
Finally, we compare the formal feature-based test-case library with features of the actual test-case library to evaluate the test completeness and coverage. This step answers the following questions:
1) Which formal feature-based test cases are covered by the actual test-case library? What is the percentage?
2) Which formal feature-based test cases are not covered by the current diagnosis and optimization capabilities? Which test cases fail the verification?
3) Which formal feature-based test cases encounter frequent regression problems in a specific period of time?
4) What is the test case coverage of each capability?
Before the official release of DAS, the DAS SQL diagnosis and optimization service had been running stably in Alibaba Group for more than three years, diagnosing about 50,000 SQL statements on average every day. It effectively optimizes SQL statements for Alibaba Group’s business applications and mainly applies to the following scenarios:
1) Self-service optimization: The service diagnoses exceptions reported by users and provides expert optimization suggestions.
2) Auto-optimization: The service automatically identifies slow queries in the workload of business database instances, proactively diagnoses these queries, and generates optimization suggestions. After a comprehensive evaluation, it schedules optimization tasks. Then it implements the optimization task and tracks the effectiveness of the solution. This is a fully automated, closed-loop optimization process that improves database performance and continuously keeps the database instances in the optimal state.
Over the past three years, the SQL diagnosis accuracy remained above 98%, and more than 75% of the diagnosed slow queries were provided with optimization suggestions.
By the end of March 2020, more than 42 million slow SQL queries had been optimized through auto-SQL optimization, reducing Alibaba Group’s slow SQL queries by about 92%.
More importantly, the SQL diagnosis and optimization service has built an effective and proactive analysis and feedback system. Online diagnosis failures, user-reported cases, and cases rolled back during auto-optimization have been automatically sent to the case system to drive rapid iteration and evolution of the service.
We will post a series of articles on a weekly basis to introduce features of Alibaba Cloud DAS in detail, such as AutoScale, exception detection, auto-SQL optimization, workload-based SQL review, intelligent stress testing, and intelligent parameter tuning. Please stay tuned.
SQL diagnosis and optimization is free if you have an Alibaba Cloud DAS instance. Visit the Database Autonomy Service page to apply for a free trial of DAS.