Solving Database Crashes Due to Service Exceptions — Auto-SQL Throttling with Alibaba Cloud DAS

By Yunji, senior database R&D engineer from Alibaba Cloud ApsaraDB

In this article, we discuss how Alibaba Cloud Database Autonomy Service (DAS) provides auto-SQL throttling to handle service exceptions and database crashes effectively with quick turnaround time.

1) Background

However, almost all the existing automated solutions focus on ensuring server-end stability. Issues caused by service-end exceptions, such as slow SQL statements caused by service changes and traffic bursts, still require human intervention. These solutions are unable to rapidly handle service exceptions and prevent systems from crashing.

2) Issues

2.1 Traffic Issues — Sudden Traffic Bursts

2.2 Data Issues — SQL Statements on Skewed Data

2.3 SQL Issues — Poor SQL Statements

Conventional solutions such as killing problematic SQL statements, restarting the database, or switching to a secondary database cannot solve these issues.

Database administrators (DBAs) are not usually able to do anything other than watch the database crash or call developers for help.

If you are familiar with the Alibaba Cloud products, you probably know that ApsaraDB for RDS for MySQL 5.6 and later versions effectively solves these issues through SQL throttling.

SQL throttling specifies the maximum degree of parallelism for executing any SQL statement in the database. By throttling the degree of parallelism for abnormal SQL statements, the database responds to service requests effectively while ensuring most of the services operate normally. In other words, the normal operation of the majority of services is guaranteed by compromising a minority of them.

This feature has been put into use for nearly three years. Many users have the following questions while using it:

1) How do I detect an exception immediately when it occurs?
2) How do I identify SQL statements that need to be throttled when an exception occurs?
3) How do I extract keywords for SQL statements that need to be throttled to minimize the impact on services when facilitating service recovery?
4) How do I quickly verify whether a throttling operation has been executed properly?

In addition to these questions, you may face some other problems. For example, when an emergency occurs, your DBA cannot be reached or does not have immediate access to a computer; there is too much information to analyze, or operation errors are introduced during troubleshooting.

Therefore, you need to automate the whole process including detecting exceptions, locating abnormal SQL statements, throttling SQL statements, and tracking and rolling back operations.

The automated SQL throttling solution was then developed and has been running on Alibaba Group’s internal databases for more than two years. It was officially made available to the public as part of an Alibaba Cloud product in February 2020. To try this feature, go to the Alibaba Cloud Database Autonomy Service (DAS) product page.

3) Implementation

3.1 Overall Process

3.2 Performance Metric Collection

3.3 Exception Detection

3.4 Root Cause Analysis

We categorize common root causes into four types:

  • Blocking SQL Statements: DAS analyzes database instance information, such as real-time sessions, lock waits, and ongoing transactions, to determine whether DDL changes, large transactions, or lock waits exist. Then DAS calculates the number of affected sessions and the execution time of these sessions. If the exception affects a relatively large number of sessions, or the affected sessions require a rather long time to execute, DAS kills abnormal sessions without throttling.
  • Resource-consuming SQL Statements or Poor SQL Statements: These SQL statements have a low degree of parallelism but consume massive amounts of CPU, I/O, or network resources, and are constantly committed.
  • Traffic-intensive SQL Statements: A large number of normal SQL statements execute at the same time in the database, causing a resource bottleneck in the database and causing the slow response to SQL queries on data stored in key-value format.
  • Others: This category includes all the other cases that are not covered by the preceding three types.

3.5 Auto-throttling

3.6 Keyword Selection

  • SQL Template-based Throttling: An SQL template contains text extracted from specific parameters of SQL statements. These SQL statements will certainly cause exceptions when a large number of them are executed in parallel, regardless of the specific parameter values. SQL template-based throttling is suitable for scenarios with traffic bursts and unindexed tables. In this case, DAS automatically throttles SQL statements that match keywords in the template.
  • SQL Text-based Throttling: This mode is suitable for scenarios with skewed data, where some SQL statements that match the template are executed normally, but some are not. In this case, the keywords must contain not only the information in the template but also specific parameter values.

For SQL template-based throttling, if an SQL statement includes ID information contained in the template, the ID information has a higher priority. Such ID information includes the SQL ID automatically generated by the database middleware based on a template, and the hint information added to the SQL template by the developer.

The benefit of using an ID as a keyword is that it helps to ensure the template is unique and the SQL statements based on other templates are not affected. The drawback is that if a problematic SQL statement does not include ID information (for example, when it is manually executed from a command line), this SQL statement can still be executed when a throttling rule against the corresponding ID is enabled.

If an SQL statement does not include template ID information, text information needs to be extracted. Assume that we have obtained an SQL template during the analysis process. As shown in the following snippet, Template 1 and Template 2 can be respectively obtained through a calculation based on SQL text 1 and SQL text 2. To throttle SQL statements based on Template 1, the most comprehensive keywords that we can use are select, id, name, age, from, students, where, and name.

The benefit of using these keywords when throttling SQL statements is that all SQL statements that contain these keywords are throttled regardless of the connection mode that was used to send these SQL statements. The drawback is that some SQL statements may be mistakenly throttled. For example, Template 2 includes all keywords of Template 1.

/* SQL 文本1 */
select id,name,age from students where name='张三';
/* SQL 模板 1 */
select id,name,age from students where name = ?
/* SQL 文本2 */
select id,name,age from students where name='张三' and sid='唯一ID';
/*SQL 模板 2*/
select id,name,age from students where name=? and sid=?

3.7 Auto-optimization

3.8 Tracking and Rollback

4) Use

This feature is now available in Alibaba Cloud DAS.

We will post a series of articles on a weekly basis to introduce Alibaba Cloud DAS features in detail, such as AutoScale, exception detection, auto-SQL optimization, workload-based SQL review, intelligent stress testing, and intelligent parameter tuning. Please stay tuned.

Visit Alibaba Cloud Database Autonomy Service (DAS) to learn more about its pricing and capabilities.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.