Optimal Solution to 100% CPU Usage of Databases

1) Introduction

This article introduces AutoScale, an innovative feature of Alibaba Cloud Database Autonomy Service (DAS). Based on the real-time performance data of database instances, DAS AutoScale detects abnormal traffic and gives suggestions on the appropriate database instance type and disk capacity. This service enables you to automatically scale up the storage and computing resources of your database. In this article, we focus on:

  • How can we quickly diagnose and deal with various unexpected database performance issues?
  • How can we meet business needs with minimal resource cost?

2) Background

Choosing an appropriate database instance type for your business application is challenging. Selecting an instance with unnecessarily high specification wastes your resources, and the one with insufficient computing resources affects your business.

3) Technical Challenges

While optimizing databases, we often scale up or down computing nodes. Although this operation only involves CPU and memory resources, it significantly impacts the production environment. It affects operations such as data migration, high availability (HA) switchover, proxy switchover, and ultimately the business.

Challenge 1 — How Can We Tell Whether the Scaling Operation Can Address the Problem?

In database management, 100% CPU usage indicates insufficient computing resources. There are various root causes and various solutions depending on the cause. For example, when business traffic surges and the current resources cannot meet the computing needs, auto-scaling would be a good choice. However, resource shortage could also be caused by a large number of slow SQL queries that congest the task queues and occupy a large number of computing resources. In this case, experienced DBAs would first think of SQL throttling, instead of scaling. When detecting resource shortage, DAS also needs to identify the root cause of the problem and make proper decisions accordingly, for example, throttling or scaling.

Challenge 2 — How Can We Choose the Appropriate Timing and Method for Scaling?

The timing of emergent scaling is closely related to the accuracy of emergency judgment. If emergency alerts are sent too frequently, the instances may be excessively scaled up, resulting in unnecessary costs. If an emergency alert is sent late, the emergency may affect your business or even cause business failures. In real-time monitoring, it is difficult to predict whether a sudden exception still exists from one moment to the next. Therefore, it is difficult to determine whether an emergency alert is required.

Challenge 3 — How Can We Choose the Appropriate Instance Type?

In database management, multiple O&M operations are involved whenever changing instance types. For example, to change the instance type of a physical database, you must perform many operations such as data file migration, cgroup isolation and redistribution, traffic proxy node switchover, and primary/secondary node switchover. Scaling a Docker-based database is even more complex, and it involves additional microservice processes such as Docker image generation, ECS instance selection, and inventory management. Choosing an appropriate instance type can effectively reduce the number of instance changes and saves significant time for the business.

4) Solutions

This section describes the DAS AutoScale feature from three perspectives: product capabilities, solutions to the preceding challenges, and core technologies. Specifically, DAS AutoScale provides auto-storage resizing and auto-instance scaling for two database services: RDS and PolarDB. Finally, we will show you how to use this service using a case study.

4.1 Product Capabilities

In terms of product capabilities, DAS AutoScale currently provides the auto-storage resizing and auto-instance scaling services for Alibaba Cloud RDS and PolarDB databases.

4.2 Solution Description

To implement these product capabilities, DAS AutoScale provides a closed-loop data process, as shown in Figure 1:

Figure 1. Closed-loop data process of DAS AutoScale
  • The decision-making center makes global decisions based on information such as the current performance data and the instance session list to solve Challenge 1. For example, the decision-making center may enable SQL throttling to address the problem of insufficient computing resources or resume the AutoScale process in response to a sudden burst of business traffic.
  • As the core module of DAS AutoScale, the algorithm module provides models to detect abnormal service loads and recommend the appropriate storage capacity or instance type. It solves Challenges 2 and 3.
  • The instance type suggestion & verification module provides specific suggestions and adapts the deployment type of the database instance to the actual operating environment. It also verifies the deployment type against the available instance types in the current region, to ensure that the suggestions can be implemented by the control execution module.
  • The control execution module distributes and implements the instance type suggestions.
  • The scaling status-tracking module measures and tracks the performance changes in the database instances before and after instance scaling.
Figure 2. Storage resizing solution
Figure 3. Instance scaling solution

4.3 Core Technologies

DAS AutoScale relies on the overall technical strength of the data link, management, and kernel teams for ApsaraDB. It mainly uses the following key technologies:

4.4 Case Study

Figure 4. Instance scaling task

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store