The Architecture behind Cainiao’s Elastic Scheduling System (Continued)

The Policies of Cainiao’s Ark Architecture

To continue our discussion on the ark architecture that is behind the elastic scheduling system of Cainiao, in this article we’re first going to take a look at some of the policies involved in this architecture.

Decision-making Policies

The elastic scheduling system of Ark supports the quick horizontal extension of decision-making policies. Currently, multiple decision-making policies are included, and some policies are being tested or verified. The following introduces several of the earliest core policies that went online.

Resource Security Policies

Resource security policies focus on the usage of system resources. Based on the past O&M experience and the business dynamics of Cainiao, our resource security policies focus on three system parameters, which are CPU, LOAD1, and Process Running Queue.

Resource Optimization Policies

Resource optimization policies also focus on the usage of system resources, but they are intended to reclaim resources when the system is idle. These policies also focus on the preceding three system parameters. When all the three parameters are below the lower threshold, a scale-in request is initiated. Note that due to the existence of the second layer of decision-making, when the other decision-making layers require scale-out, scale-in requests generated by resource optimization policies are suppressed.

Time Policies

The current elastic decision-making model is posterior in nature. That is, a scale-out request is initiated only when a threshold violation occurs. For some services, such as regular computing tasks, traffic may surge on a regular basis.

Service Security Policies

Service security policies are the most complex among all policies. Currently, services include the Consumer in the message queue, Remote Procedure Call (RPC) service, and Hyper Text Transfer Protocol (HTTP) service. At least half of the scale-out tasks each day are initiated by service security policies.

Selection between Queries Per Second and Response Time

Next, there’s the important question of the selection between queries-per-second (or QPS) and response time (RT). Given the relatively complexity of this topic, we’ve decided to make this its own section. Many elastic scheduling systems consider QPS as the most important factor.

  • Many core applications go across service links and therefore provide various services.
  • Currently, the effect of container isolation is not perfect.
  • Services may change at any time.

Threshold Comparison and Multi-service Voting

Automatic threshold configuration for RT of massive services was introduced in a preceding example and will not be described further in this section.

Downstream Analysis

Scale-out is not conducted for all RT threshold violations. If the RT rises due to the middleware or downstream services on which a service depends, scale-out cannot solve the problem and may even exacerbate it. An example of this is if the database thread pool becomes full. Therefore, during computing, when a service security policy finds that the RT of a service violates its own threshold, the service security policy will first check whether the RT of the middleware and the RT of downstream services on which this service depends violate their respective SLAs. This service participates in the voting on scale-out only when it violates its own threshold but the corresponding middleware and downstream services do not violate their thresholds. When an offline computing task calculates the SLA of a service based on historical data, it also calculates the thresholds of the middleware and downstream services on which this service depends. Among the historical data, data on the link structure derives from the offline data of EagleEye.

Other Miscellaneous Topics

The Way in Which the Elastic Scheduling System Handles Glitches

An elastic scheduling system needs to ensure that scaling is timely enough, while also having enough tolerance for glitches. Otherwise, incorrect scale-out or scale-in may occur. Glitches are common in the actual environment. Problems with the environment, such as uneven traffic and jitters, can result in glitches, and potential anomalies and bugs in the monitoring and statistical system may also cause glitches.

How the Elastic Scheduling System Calculates the Number of Containers

Determining the number of containers to be scaled out or in is the second step after a scaling decision is made. Let’s first take a look at scale-in. The speed of scale-in is faster than that of scale-out, and scale-in does not affect stability. Therefore, the elastic scheduling system of Ark scales in containers steadily and promptly. When a scale-in decision is made, the fixed number of containers to be scaled in is 10% of the current number of containers in an application cluster.

How the Elastic Scheduling System Makes the Capacity of Links Elastic

The capacity of a link is considered to be elastic when all the application clusters along this link have been connected to the elastic scheduling system.

How the Elastic Scheduling System Copes with Traffic Surges

Ark provides a complete suite of solutions for resource management by combining its container scheme and elastic scheduling. The container scheme allows all application owners to request capacity during the specified promotional periods involving stress testing and big promotions. When application clusters connected to the elastic scheduling system of Ark enter the specified promotional periods, the scaling mode of this system will be immediately switched to the manual verification mode. Moreover, the number of containers in each application cluster will be scaled out to the number of containers that you requested through the container scheme. When the promotional periods end, the number of containers in an application cluster not connected to the elastic scheduling system will be directly scaled in to its original number of containers, whereas an application cluster connected to the elastic scheduling system will slowly reclaim resources through an elastic scheduling policy.

How Elastic Scheduling Propels the Services Evolution

Using elastic scheduling to push forward the evolution of services has been our permanent goal. For us, only this can enable a truly closed loop of services. This impetus is based on data analysis. Currently, the elastic scheduling system of Ark produces the following types of data:

  • Data on the correlation between RT and resource usage.
  • Data on middleware stability.
  • Data on downstream stability.
  • Data on the startup timeliness of applications.
  • Data on the rationality of throttling configuration.

Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: