Solving Database Crashes Due to Service Exceptions — Auto-SQL Throttling with Alibaba Cloud DAS
By Yunji, senior database R&D engineer from Alibaba Cloud ApsaraDB
In this article, we discuss how Alibaba Cloud Database Autonomy Service (DAS) provides auto-SQL throttling to handle service exceptions and database crashes effectively with quick turnaround time.
As the technology matures, especially with the popularization of cloud databases, database systems are becoming increasingly stable and easier to maintain. Many tasks such as version upgrades and instance migration are now completed automatically without affecting upper-layer services. Even if a hardware device or the network is abnormal, the preventive maintenance inspection (PMI) system immediately migrates data and restarts services to ensure service stability.
However, almost all the existing automated solutions focus on ensuring server-end stability. Issues caused by service-end exceptions, such as slow SQL statements caused by service changes and traffic bursts, still require human intervention. These solutions are unable to rapidly handle service exceptions and prevent systems from crashing.
2.1 Traffic Issues — Sudden Traffic Bursts
There is a dramatic increase in the number of SQL statements executed in parallel, for example, in the case of cache penetration, abnormal calls, and major promotions.
2.2 Data Issues — SQL Statements on Skewed Data
For example, SQL statements querying for information about an account that involves a large number of transactions in the order data tend to slow down the database.
2.3 SQL Issues — Poor SQL Statements
If a large number of SQL statements are executed on new database tables that are not indexed, the system will slow down and normal services will be affected.
Conventional solutions such as killing problematic SQL statements, restarting the database, or switching to a secondary database cannot solve these issues.
Database administrators (DBAs) are not usually able to do anything other than watch the database crash or call developers for help.
If you are familiar with the Alibaba Cloud products, you probably know that ApsaraDB for RDS for MySQL 5.6 and later versions effectively solves these issues through SQL throttling.
SQL throttling specifies the maximum degree of parallelism for executing any SQL statement in the database. By throttling the degree of parallelism for abnormal SQL statements, the database responds to service requests effectively while ensuring most of the services operate normally. In other words, the normal operation of the majority of services is guaranteed by compromising a minority of them.
This feature has been put into use for nearly three years. Many users have the following questions while using it:
1) How do I detect an exception immediately when it occurs?
2) How do I identify SQL statements that need to be throttled when an exception occurs?
3) How do I extract keywords for SQL statements that need to be throttled to minimize the impact on services when facilitating service recovery?
4) How do I quickly verify whether a throttling operation has been executed properly?
In addition to these questions, you may face some other problems. For example, when an emergency occurs, your DBA cannot be reached or does not have immediate access to a computer; there is too much information to analyze, or operation errors are introduced during troubleshooting.
Therefore, you need to automate the whole process including detecting exceptions, locating abnormal SQL statements, throttling SQL statements, and tracking and rolling back operations.
The automated SQL throttling solution was then developed and has been running on Alibaba Group’s internal databases for more than two years. It was officially made available to the public as part of an Alibaba Cloud product in February 2020. To try this feature, go to the Alibaba Cloud Database Autonomy Service (DAS) product page.
3.1 Overall Process
The following diagram shows the overall implementation process:
3.2 Performance Metric Collection
This module is enabled by default for hosts and engines of ApsaraDB Relational Database Service (RDS) instances. These performance metrics include CPU usage, input/output operations per second (IOPS), queries per second (QPS), and active sessions. These real-time metrics serve as the basis for all subsequent analysis and actions.
3.3 Exception Detection
This module uses historical performance data of a database instance to train a machine learning model offline. Then it uses this model to detect exceptions based on real-time performance metrics. This method detects exceptions much sooner, before receiving the threshold-based alerts. This module will be described in detail later in another article.
3.4 Root Cause Analysis
This module subscribes to exception events of a database instance and collects session information the moment when an exception occurs. Then it analyzes the collected information together with statistics of all the SQL statements in SQL performance_schema (generated through SQL audit) to find the root cause of the exception.
We categorize common root causes into four types:
- Blocking SQL Statements: DAS analyzes database instance information, such as real-time sessions, lock waits, and ongoing transactions, to determine whether DDL changes, large transactions, or lock waits exist. Then DAS calculates the number of affected sessions and the execution time of these sessions. If the exception affects a relatively large number of sessions, or the affected sessions require a rather long time to execute, DAS kills abnormal sessions without throttling.
- Resource-consuming SQL Statements or Poor SQL Statements: These SQL statements have a low degree of parallelism but consume massive amounts of CPU, I/O, or network resources, and are constantly committed.
- Traffic-intensive SQL Statements: A large number of normal SQL statements execute at the same time in the database, causing a resource bottleneck in the database and causing the slow response to SQL queries on data stored in key-value format.
- Others: This category includes all the other cases that are not covered by the preceding three types.
When the root cause of an instance exception falls into the second or third type of SQL statements described above, DAS automatically extracts keywords from problematic SQL statements to throttle these SQL statements. This operation requires user authorization. The biggest challenge is to extract the right keywords from SQL statements to implement precise throttling without unnecessarily interrupting other services.
3.6 Keyword Selection
If an abnormal SQL statement that needs to be throttled is detected, the next step is to extract keywords from this SQL statement. Ideally, the keywords should be unique, and only the abnormal SQL statements are throttled without affecting other SQL statements. DAS supports two throttling modes: SQL template-based throttling and SQL text-based throttling. Do not confuse them with each other.
- SQL Template-based Throttling: An SQL template contains text extracted from specific parameters of SQL statements. These SQL statements will certainly cause exceptions when a large number of them are executed in parallel, regardless of the specific parameter values. SQL template-based throttling is suitable for scenarios with traffic bursts and unindexed tables. In this case, DAS automatically throttles SQL statements that match keywords in the template.
- SQL Text-based Throttling: This mode is suitable for scenarios with skewed data, where some SQL statements that match the template are executed normally, but some are not. In this case, the keywords must contain not only the information in the template but also specific parameter values.
For SQL template-based throttling, if an SQL statement includes ID information contained in the template, the ID information has a higher priority. Such ID information includes the SQL ID automatically generated by the database middleware based on a template, and the hint information added to the SQL template by the developer.
The benefit of using an ID as a keyword is that it helps to ensure the template is unique and the SQL statements based on other templates are not affected. The drawback is that if a problematic SQL statement does not include ID information (for example, when it is manually executed from a command line), this SQL statement can still be executed when a throttling rule against the corresponding ID is enabled.
If an SQL statement does not include template ID information, text information needs to be extracted. Assume that we have obtained an SQL template during the analysis process. As shown in the following snippet, Template 1 and Template 2 can be respectively obtained through a calculation based on SQL text 1 and SQL text 2. To throttle SQL statements based on Template 1, the most comprehensive keywords that we can use are select, id, name, age, from, students, where, and name.
The benefit of using these keywords when throttling SQL statements is that all SQL statements that contain these keywords are throttled regardless of the connection mode that was used to send these SQL statements. The drawback is that some SQL statements may be mistakenly throttled. For example, Template 2 includes all keywords of Template 1.
/* SQL 文本1 */
select id,name,age from students where name='张三';
/* SQL 模板 1 */
select id,name,age from students where name = ?
/* SQL 文本2 */
select id,name,age from students where name='张三' and sid='唯一ID';
/*SQL 模板 2*/
select id,name,age from students where name=? and sid=?
When an abnormal SQL statement that can be optimized is identified through root cause analysis, the SQL statement is also sent to the auto-optimization module when throttling is enabled, for example, to automatically create an index in the corresponding table. This module will be described in detail later in another article.
3.8 Tracking and Rollback
Continuous tracking is performed after auto-throttling. If the traffic sent to the database is not reduced or is not reduced as much as expected, DAS automatically rolls back the throttling operation and starts another round of root cause analysis.
This feature has been running steadily for more than two years on Alibaba Group’s databases. It detects exceptions within a minute post their occurrence, identifies the root causes within 5 minutes, and recovers the service within 10 minutes, effectively reducing database faults.
This feature is now available in Alibaba Cloud DAS.
We will post a series of articles on a weekly basis to introduce Alibaba Cloud DAS features in detail, such as AutoScale, exception detection, auto-SQL optimization, workload-based SQL review, intelligent stress testing, and intelligent parameter tuning. Please stay tuned.
Visit Alibaba Cloud Database Autonomy Service (DAS) to learn more about its pricing and capabilities.