Database administrators and application developers have all experienced SQL optimization. SQL statements running in databases are varied and continuous. Dynamic changes make SQL optimization necessary. Such changes include fast business iteration, change of data distribution characteristics, hot-data changes, and database version upgrades. For these changes, similar processes trigger different challenges each time.
So, How Can We Use Integrated Methods to Quickly and Accurately Locate Problems?
It is not enough to analyze slow query logs for abnormal SQL queries. How do we accurately identify bottlenecks based on database experts’ experience or tools and obtain rectification and optimization recommendations? How do we comprehensively evaluate the optimization effect and impact (including the adverse impact), evaluate the impact on related SQL, and write operations to ensure pre-release security? How do we select phased release policies and change windows for complex deployments, such as large-scale database sharding or table sharding, to promote online changes securely and stably? How do we continuously track the effects to ensure that everything is problem-free?
We need to consider two important time points (as shown in the following figure.) In a simple slow SQL query trend, T1 represents the time point when we find out the performance of the database instance is abnormal. From this point on, we start to optimize slow SQL queries. T2 represents the time point when the optimization is completed and the instance restores. In traditional optimization, this process is manually driven and usually has the following shortcomings:
- T1 is too late. Exception discovery and response are not in time. When exceptions are found, a large number of issues may be accumulated and cause fault risks.
- The interval between T1 and T2 is too long, which greatly affects the user experience and dramatically increases fault risks.
In addition to the preceding two issues, we are faced with two more serious challenges:
- How do we implement continuous optimization to detect and mitigate issues promptly to prevent a large number of problems from being accumulated and ensure that database instances always run stably and optimally?
- How can we shorten the processing duration to minimize the impact and use integrated methods to ensure database instance stability and address both symptoms and root causes?
Traditional methods are manually driven and have inherent limitations in the preceding two challenges. As a result, developers are often driven by faults and busy addressing faults. With large-scale business development and instance growth, these issues will be enlarged and probably cannot be resolved if more labor is invested.
Automatic SQL optimization is one of the core services of Alibaba Cloud Database Autonomy Service (DAS). It uses autonomous self-optimization to ensure SQL optimization closure.
As shown in the preceding figure, its closure capabilities include:
1) Detects abnormal workloads, identifies database business changes, and quickly identifies and locates abnormal SQL queries. For example, SQL exceptions include additional slow SQL queries, SQL queries that cause performance deterioration, and inefficient SQL queries.
2) Calls the SQL diagnosis and optimization service for abnormal SQL queries to generate optimization recommendations, for example, optimal indexing, SQL statement rewriting, and engine recommendations.
3) Evaluates optimization recommendation risks, generates a phase-release plan, and orchestrates optimization tasks based on database instance workloads and profiles.
4) Selects O&M windows and makes online changes based on the phased-release plan. In the current phase, indexes can be automatically changed online.
5) Tracks the multidimensional optimization effect for online changes and continuously and comprehensively evaluates performance regression risks. If the optimization meets the expectation, optimization benefit is automatically calculated. If the optimization fails to meet the expectation, it is automatically rolled back.
This closed automatic optimization loop converts manual and passive optimization to intelligent, proactive, and continuous optimization, ensuring unattended SQL optimization. With this closed-loop, you have a group of database experts who ensure that your database system always runs optimally 24/7.
To ensure unattended SQL optimization, we must address multiple challenges:
Accuracy: An exception detection mechanism is required to accurately identify the optimization time and locate abnormal SQL queries.
Professionalism: Professional and effective optimization diagnosis is required.
Security: The security of online changes must be controllable.
Comprehensiveness: The optimization effect needs to be monitored comprehensively from multiple dimensions and evaluated promptly to ensure security.
Association: Complex online problems sometimes require integrated management. For example, automatic SQL throttling and SQL optimization of DAS needs to be associated with sudden, malignant, slow SQL queries to solve both symptoms and root causes.
Scale: A scalable service architecture needs to be built to support large-scale automatic optimization for hundreds, thousands, or millions of SQL queries. The following interprets the DAS solution from multiple dimensions.
DAS automatic SQL optimization is a data-driven closed-loop. The preceding figure briefly illustrates the whole process.
- Exception events: Exception events trigger automatic SQL optimization and are centrally managed by the DAS event center. Exception events result from real-time exception detection, offline analysis, workload detection, and the alerting system.
- Diagnosis: After receiving an exception event from the event center, the automatic SQL optimization service preliminarily checks the instance, sends a diagnosis request to the diagnosis engine, and processes the diagnosis result, which can be one or more recommendations. After evaluating the effectiveness, the automatic SQL optimization service generates and sends new optimization events to the event center to trigger the next optimization process.
- Recommendation push: You can enter the DAS “autonomy center”, determine whether to accept the optimization recommendations if the full autonomy mode is disabled, and trigger the next automatic optimization process.
- Change release: The DAS automatically selects an O&M window period, delivers the change command, and confirms the execution result.
- Effect tracking and assessment: After the optimization recommendations take effect, the decision engine will start a tracking task to track the performance of optimized and related SQL queries. If the performance deteriorates, the optimization is automatically rolled back. Typically, if rollback is not performed 24 hours after the tracking, the optimization benefit is calculated.
2. Issue Discovery
SQL optimization can discover SQL exceptions in various scenarios:
- Timed triggering
The automatic SQL optimization service periodically analyzes slow SQL queries of instances offline in the O&M window period and initiates SQL optimization.
- Triggered when the performance of some SQL queries deteriorates
The workload exception detection algorithm will detect SQL queries with deteriorating performance promptly and trigger automatic SQL optimization. For complex online issues, automatic SQL optimization and automatic SQL throttling of DAS will be associated to initiate automatic SQL optimization.
- Triggered when the instance workload changes
With online and offline business SQL queries, the database workload and data amount change. Existing indexes cannot match business performance requirements or initiate instance workload diagnosis and optimization.
3. Diagnosis Capability
The DAS SQL diagnosis and optimization service provide strong support for automatic SQL optimization. It uses the cost-based model to formulate optimization measures, which is the same as the database optimizer. It quantifies all possible recommendation options based on the execution cost and makes reliable recommendations. The service has run stably in Alibaba for over three years with a daily diagnosis volume of about 50,000. It also supports SQL optimization for the business applications of the group. In the past three years, the SQL diagnosis success rate remained over 98%, and the recommendation rate for slow SQL queries was more than 75%.
4. Secure Changes
To ensure secure changes, security check before changes, phased change policy, and performance tracking after changes are required.
Security check: To reduce risks, changes only occur in the O&M window period. We will also check the latency for data synchronization between the primary and secondary databases, instance workload, and table space. Changes are made only if these indicators are within the secure scope.
Phased change policy: In a large-scale database or table sharding scenarios, a phased change plan is automatically generated to reduce risks. During the change process, the system will monitor the latency for data synchronization between the primary and secondary instances. If the latency exceeds the threshold, all index change tasks of the database will be stopped, and only one change task can be executed in each database.
Effect evaluation: The effect evaluation algorithm tracks the performance of optimized SQL queries and related SQL templates to prevent faults due to performance deterioration. Based on the decision tree model, the performance tracking algorithm compares the performance indicators of the SQL template before and after optimization to comprehensively determine whether the SQL template has experienced performance deterioration at that time. Business is often changed by day and the default tracking time is 24 hours. If rollback is not performed, optimization is successful and the optimization benefit is calculated.
Tested Over Time
The DAS automatic SQL optimization service ran stably without faults in Alibaba for about two years before being released in the cloud. As of April 2020, the service has optimized more than 42 million slow SQL queries, and the slow SQL query rate of the group network dropped by about 92%. More importantly, the automatic SQL optimization service has built an effective proactive analysis and feedback system. Online failures and rollback examples during automatic optimization are automatically accumulated to the example system. The service continuously drives automatic optimization closure and fast iteration of the diagnosis service.
Where Can I Find It?
Alibaba Cloud DAS has recently launched this new feature. It is now available for a free trial by clicking here.