Optimal Solution to 100% CPU Usage of Databases
This article introduces AutoScale, an innovative feature of Alibaba Cloud Database Autonomy Service (DAS). Based on the real-time performance data of database instances, DAS AutoScale detects abnormal traffic and gives suggestions on the appropriate database instance type and disk capacity. This service enables you to automatically scale up the storage and computing resources of your database. In this article, we focus on:
- How can we optimize our database to maximize its performance?
- How can we quickly diagnose and deal with various unexpected database performance issues?
- How can we meet business needs with minimal resource cost?
Choosing an appropriate database instance type for your business application is challenging. Selecting an instance with unnecessarily high specification wastes your resources, and the one with insufficient computing resources affects your business.
Generally, most database administrators (DBAs) choose an instance (for example, 4-core CPU, 8 GB RAM) that allows services to run smoothly with proper CPU usage (for example, less than 50%) and sufficient disk space (for example, 200 GB).
However, in real-life database operation and maintenance (O&M), database resources are often exhausted due to bursts of online application traffic, which may occur in the following scenarios:
1) Release of a new service: This happens when a new service is released but the service traffic is underestimated, which exhausts database resources. For example, a new application receives massive amounts of traffic, or a new feature is released on a platform with tremendous traffic.
2) Unexpected traffic surge: For example, a temporary traffic burst caused by trending topics, or flash sales triggered by internet celebrities.
3) Centralized access: Some services are accessed in a centralized manner at specific time points. For example, daily check-ins and outs, or weekly financial audits. The workload under these scenarios is generally low during off-peak hours. Therefore, most DBAs would allocate limited resources to reduce costs, despite the known access peaks.
Unexpected shortage of computing resources in the preceding scenarios will seriously affect the business. Therefore, dealing with the exhaustion of database resources is one of the challenges that many DBAs have to face.
Database resources include computing and storage resources, and the exhaustion of database resources covers the following two aspects.
1) Computing resource exhaustion, also known as 100% CPU usage: This happens when there are insufficient computing resources for the current database instance, and it cannot handle incoming access requests.
2) Storage resource exhaustion, also known as 100% disk space usage: This happens when disk space for the instance is full, and no new data can be written to the database.
To solve these two problems, DAS enables storage and computing resource auto-scaling through service innovation.
As an innovative feature, DAS AutoScale detects abnormal traffic and gives suggestions on the appropriate database instance type and disk capacity based on the real-time performance data of database instances. This service enables you to automatically scale up the storage and the computing resources of your database.
Next, let’s introduce the architecture of DAS AutoScale in detail, including the technical challenges, solutions, and key technologies.
3) Technical Challenges
While optimizing databases, we often scale up or down computing nodes. Although this operation only involves CPU and memory resources, it significantly impacts the production environment. It affects operations such as data migration, high availability (HA) switchover, proxy switchover, and ultimately the business.
When an instance is configured with insufficient computing resources, business traffic bursts can lead to 100% CPU usage. Generally, this problem can be solved by scaling up the database instance. A DBA will have to answer at least the following three questions when preparing the scaling plan:
1) Can the scaling operation address the problem of insufficient resources?
2) When should the scaling operation be performed?
3) Which instance type should be chosen?
To answer these three questions, DAS faces the following three challenges:
Challenge 1 — How Can We Tell Whether the Scaling Operation Can Address the Problem?
In database management, 100% CPU usage indicates insufficient computing resources. There are various root causes and various solutions depending on the cause. For example, when business traffic surges and the current resources cannot meet the computing needs, auto-scaling would be a good choice. However, resource shortage could also be caused by a large number of slow SQL queries that congest the task queues and occupy a large number of computing resources. In this case, experienced DBAs would first think of SQL throttling, instead of scaling. When detecting resource shortage, DAS also needs to identify the root cause of the problem and make proper decisions accordingly, for example, throttling or scaling.
Challenge 2 — How Can We Choose the Appropriate Timing and Method for Scaling?
The timing of emergent scaling is closely related to the accuracy of emergency judgment. If emergency alerts are sent too frequently, the instances may be excessively scaled up, resulting in unnecessary costs. If an emergency alert is sent late, the emergency may affect your business or even cause business failures. In real-time monitoring, it is difficult to predict whether a sudden exception still exists from one moment to the next. Therefore, it is difficult to determine whether an emergency alert is required.
Usually, two scaling approaches are used: horizontal scaling (scaling out) by adding read-only nodes and vertical scaling (scaling-up) by adding resources of a single node.
Horizontal scaling is suitable for scenarios with more read traffic but less write traffic. However, when scaling out a conventional database, you need to migrate data to create read-only nodes, and data generated in the primary node during the migration requires incremental synchronization and updates in secondary nodes. It takes a relatively long time, making it less time-efficient in some cases.
Vertical scaling is an upgrade of the existing instance. The general practice is to first upgrade the secondary database, switch services from the primary database to the secondary database, and then upgrade the primary database. This process reduces the impact on your business, but it also involves data synchronization and can cause data latency. Therefore, we must decide when and how to choose the appropriate scaling method based on the specific type of traffic on the current instance.
Challenge 3 — How Can We Choose the Appropriate Instance Type?
In database management, multiple O&M operations are involved whenever changing instance types. For example, to change the instance type of a physical database, you must perform many operations such as data file migration, cgroup isolation and redistribution, traffic proxy node switchover, and primary/secondary node switchover. Scaling a Docker-based database is even more complex, and it involves additional microservice processes such as Docker image generation, ECS instance selection, and inventory management. Choosing an appropriate instance type can effectively reduce the number of instance changes and saves significant time for the business.
When the CPU usage reaches 100%, you may face two situations after the scaling operation. One is that the CPU usage decreases and the business traffic becomes stable. The other is that the CPU usage is still 100% and the traffic increases with the improved computing capacity.
The first situation is an ideal state that we want to achieve through scaling. However, the second is also very commonly seen. In this case, the new instance still cannot meet the current business traffic requirement — there are still insufficient resources, and the services are still affected. Choosing an appropriate instance type based on the operating information of the database directly affects the effectiveness of the scaling operation.
This section describes the DAS AutoScale feature from three perspectives: product capabilities, solutions to the preceding challenges, and core technologies. Specifically, DAS AutoScale provides auto-storage resizing and auto-instance scaling for two database services: RDS and PolarDB. Finally, we will show you how to use this service using a case study.
4.1 Product Capabilities
In terms of product capabilities, DAS AutoScale currently provides the auto-storage resizing and auto-instance scaling services for Alibaba Cloud RDS and PolarDB databases.
If the disk space of your purchased database instance is about to reach its upper limit, the auto-storage resizing service resizes the disk space in advance to avoid affecting your business. This service allows you to set the threshold for storage resizing. Alternatively, you can use the preset upper limit of 90% provided by DAS. When an auto-resizing event is triggered, DAS resizes the disk space for your instance.
The auto-instance scaling service allows you to adjust the computing resources of your database instance to meet your service requirements. It enables you to customize the maximum amount and duration of traffic peaks. You can also specify the maximum specification for scaling and whether to scale back during off-peak hours.
4.2 Solution Description
To implement these product capabilities, DAS AutoScale provides a closed-loop data process, as shown in Figure 1:
This process consists of the performance data collection module, decision-making center, algorithm module, instance type suggestion & verification module, control execution module, and scaling status-tracking module. These modules provide the following functions:
- The performance data collection module collects the performance data of instances in real-time, including information about the database performance metrics, configurations, and instance sessions.
- The decision-making center makes global decisions based on information such as the current performance data and the instance session list to solve Challenge 1. For example, the decision-making center may enable SQL throttling to address the problem of insufficient computing resources or resume the AutoScale process in response to a sudden burst of business traffic.
- As the core module of DAS AutoScale, the algorithm module provides models to detect abnormal service loads and recommend the appropriate storage capacity or instance type. It solves Challenges 2 and 3.
- The instance type suggestion & verification module provides specific suggestions and adapts the deployment type of the database instance to the actual operating environment. It also verifies the deployment type against the available instance types in the current region, to ensure that the suggestions can be implemented by the control execution module.
- The control execution module distributes and implements the instance type suggestions.
- The scaling status-tracking module measures and tracks the performance changes in the database instances before and after instance scaling.
The following text describes the storage resizing and instance scaling services provided by DAS AutoScale.
Figure 2 shows the storage resizing solution provided by DAS AutoScale. This solution can be triggered through user-defined trigger events or algorithm predictions. If the storage space is resized based on algorithm prediction, AutoScale uses a time series prediction algorithm to predict the future disk usage of the database instance based on historical disk usage for a specified period of time. If the disk usage is likely to exceed the disk space of your instance soon, the disk space is resized automatically. You can increase the disk space by at least 5 GB and no more than 15% of the original space each time, to provide sufficient disk space for your database instance.
Currently, the timing for auto-storage resizing is determined based on both threshold and prediction. When the disk data grows slowly and reaches the specified threshold (for example, 90%), storage resizing is triggered. If the disk data grows rapidly and is expected to deplete the storage space soon according to algorithm prediction, AutoScale will provide you with a storage resizing suggestion, and the reason for it.
Figure 3 shows the instance scaling solution provided by DAS AutoScale. First, the exception detection module detects a traffic burst from multiple dimensions (such as QPS, TPS, active sessions, and IOPS). Then, the decision-making center decides whether to implement auto-scaling. Next, the instance type suggestion & verification module generates a scaling suggestion, and the control execution module implements the scaling suggestion.
After the abnormal traffic ends, the exception detection module identifies whether the traffic has normalized. Then, the control execution module scales back the instance to the original instance type as recorded in the metadata. After the instance type-scaling process is completed, the scaling status-tracking module tracks the performance change trend during the scaling period and evaluates the results.
Currently, AutoScale is triggered based on the duration of the user-defined observation window and the exception information detected based on various performance metrics of the instance, such as the CPU usage, disk IOPS, and logical reads.
After AutoScale is triggered, the algorithm module uses a well-trained model to provide a suggestion on the instance type suitable for the current traffic, based on the current performance data, instance type, and historical performance data. The scale-back timing is also determined based on the duration of the user-defined silence window and the performance data of the instance. The scale-back operation is triggered when the specified conditions are met.
4.3 Core Technologies
DAS AutoScale relies on the overall technical strength of the data link, management, and kernel teams for ApsaraDB. It mainly uses the following key technologies:
1) Second-level database monitoring: Currently, the monitoring and collection process enables second-level data collection, monitoring, display, and diagnosis of all database instances deployed in Alibaba Cloud. DAS AutoScale is able to process more than 10 million monitoring metrics per second in real time, laying a solid data foundation for intelligent database services.
2) Unified RDS control task flow: Currently, the task flow is used for O&M of all database instances deployed in Alibaba Cloud, providing strong support for the implementation of AutoScale.
3) Time series exception detection algorithm based on prediction and machine learning: The algorithm provides many functions such as periodic detection, turning point determination, and continuous exception interval identification. Currently, DAS is able to predict the data traffic of more than 100,000 online database instances one day in advance, with a prediction error rate of less than 5% for more than 99% of instances. It is also able to predict data traffic 14 days in advance with an error rate of less than 5% for more than 94% of instances.
4) Deep learning-based database response time (RT) prediction model: This algorithm can predict the RT value of a running database instance based on multiple metrics, such as the CPU usage, logical reads, physical reads, and IOPS of the instance. The RT value provides guidance for memory scale-down of the buffer pool, saving more than 27 TB of memory space (about 17% of the total memory space) for Alibaba’s databases.
5) PolarDB based on cloud computing architecture: PolarDB is a next-generation relational database developed by the Alibaba Cloud database team for the cloud computing era. The computing nodes are separated from the storage nodes, providing powerful technical support for AutoScale. This can avoid the additional overhead caused by copying data storage, greatly improving the AutoScale experience.
With these technologies, DAS AutoScale provides suggestions on instance type selection and storage resizing for RDS and PolarDB databases. It ensures data consistency and integrity during auto-scaling, without affecting business stability.
4.4 Case Study
Figure 4 shows the auto-scaling process of an instance. A business system suddenly experienced abnormal traffic at 19:43, resulting in an abrupt surge in CPU usage and active sessions. The CPU usage increased from about 10% to more than 70%, leading to a shortage of CPU resources.
In this instance, a 15-minute observation window and a trigger condition of the CPU usage exceeding 70% were configured to avoid frequent triggering of AutoScale operations. The abnormal traffic lasted until 19:58, meeting the trigger condition. A control task was completed within 7 minutes, finishing the scaling of the primary node at 20:05. The comparison of the resource usage before and after the scale-up shows that the CPU usage and IOPS of the instance were relatively high before the scaling operation, and decreased significantly afterward.
AutoScale is free on Alibaba Cloud Database Autonomy Service (DAS), so don’t forget to give it a try today!