Solving Database Crashes Due to Service Exceptions — Auto-SQL Throttling with Alibaba Cloud DAS

1) Background

As the technology matures, especially with the popularization of cloud databases, database systems are becoming increasingly stable and easier to maintain. Many tasks such as version upgrades and instance migration are now completed automatically without affecting upper-layer services. Even if a hardware device or the network is abnormal, the preventive maintenance inspection (PMI) system immediately migrates data and restarts services to ensure service stability.

2) Issues

2.1 Traffic Issues — Sudden Traffic Bursts

There is a dramatic increase in the number of SQL statements executed in parallel, for example, in the case of cache penetration, abnormal calls, and major promotions.

2.2 Data Issues — SQL Statements on Skewed Data

For example, SQL statements querying for information about an account that involves a large number of transactions in the order data tend to slow down the database.

2.3 SQL Issues — Poor SQL Statements

If a large number of SQL statements are executed on new database tables that are not indexed, the system will slow down and normal services will be affected.

3) Implementation

3.1 Overall Process

The following diagram shows the overall implementation process:

3.2 Performance Metric Collection

This module is enabled by default for hosts and engines of ApsaraDB Relational Database Service (RDS) instances. These performance metrics include CPU usage, input/output operations per second (IOPS), queries per second (QPS), and active sessions. These real-time metrics serve as the basis for all subsequent analysis and actions.

3.3 Exception Detection

This module uses historical performance data of a database instance to train a machine learning model offline. Then it uses this model to detect exceptions based on real-time performance metrics. This method detects exceptions much sooner, before receiving the threshold-based alerts. This module will be described in detail later in another article.

3.4 Root Cause Analysis

This module subscribes to exception events of a database instance and collects session information the moment when an exception occurs. Then it analyzes the collected information together with statistics of all the SQL statements in SQL performance_schema (generated through SQL audit) to find the root cause of the exception.

  • Resource-consuming SQL Statements or Poor SQL Statements: These SQL statements have a low degree of parallelism but consume massive amounts of CPU, I/O, or network resources, and are constantly committed.
  • Traffic-intensive SQL Statements: A large number of normal SQL statements execute at the same time in the database, causing a resource bottleneck in the database and causing the slow response to SQL queries on data stored in key-value format.
  • Others: This category includes all the other cases that are not covered by the preceding three types.

3.5 Auto-throttling

When the root cause of an instance exception falls into the second or third type of SQL statements described above, DAS automatically extracts keywords from problematic SQL statements to throttle these SQL statements. This operation requires user authorization. The biggest challenge is to extract the right keywords from SQL statements to implement precise throttling without unnecessarily interrupting other services.

3.6 Keyword Selection

If an abnormal SQL statement that needs to be throttled is detected, the next step is to extract keywords from this SQL statement. Ideally, the keywords should be unique, and only the abnormal SQL statements are throttled without affecting other SQL statements. DAS supports two throttling modes: SQL template-based throttling and SQL text-based throttling. Do not confuse them with each other.

  • SQL Text-based Throttling: This mode is suitable for scenarios with skewed data, where some SQL statements that match the template are executed normally, but some are not. In this case, the keywords must contain not only the information in the template but also specific parameter values.
/* SQL 文本1 */
select id,name,age from students where name='张三';
/* SQL 模板 1 */
select id,name,age from students where name = ?
/* SQL 文本2 */
select id,name,age from students where name='张三' and sid='唯一ID';
/*SQL 模板 2*/
select id,name,age from students where name=? and sid=?

3.7 Auto-optimization

When an abnormal SQL statement that can be optimized is identified through root cause analysis, the SQL statement is also sent to the auto-optimization module when throttling is enabled, for example, to automatically create an index in the corresponding table. This module will be described in detail later in another article.

3.8 Tracking and Rollback

Continuous tracking is performed after auto-throttling. If the traffic sent to the database is not reduced or is not reduced as much as expected, DAS automatically rolls back the throttling operation and starts another round of root cause analysis.

4) Use

This feature has been running steadily for more than two years on Alibaba Group’s databases. It detects exceptions within a minute post their occurrence, identifies the root causes within 5 minutes, and recovers the service within 10 minutes, effectively reducing database faults.

Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store