Why Cainiao’s Special and How Its Elastic Scheduling System Works
By Yang Yongzhou, nicknamed Chang Ling.
In this three-part article series, we’re going to take a deep dive into the inner-workings of Alibaba’s big logistics affiliate, Cainiao, and take a look at the “hows” and “whys” behind its logistics data platform.
In particular, in this first part of the series, we’re going to see how Cainiao is able to use resources in an efficient and cost-effective manner thanks to its elastic scheduling system. Then, we will also have a brief discussion on what makes Cainiao’s systems special and how it is different to other logistics systems.
Continue reading to learn more about Cainiao: The Architecture behind Cainiao’s Elastic Scheduling System and The Architecture behind Cainiao’s Elastic Scheduling System (Continued).
How Elastic Scheduling Makes for Better Efficiency
Before we get into how elastic scheduling works at Cainiao, let’s look at how the overall resource usage of Cainiao is not always so cost efficient.
The number of containers required by a Cainiao application that normally runs and provides services at Cainiao is determined through a system that uses single-server performance stress testing and experience-based service traffic estimation. However, this method can often be disturbed to a great extent by subjective factors of the estimator, which usually retains a large amount of redundancy when estimating service traffic. Next, scaling operations for Cainiao application clusters are seldom performed, which means that the estimator estimated service traffic based on daily peak traffic and service development in an upcoming period, in particular a month long period. This is because peak traffic periods account for only a rather small portion of a day, as such a lot of resources are wasted during non-peak periods.
Following the above discussion, according to the previous performance of the application clusters with elastic scheduling enabled, inaccurate capacity estimation is something that is rather common, and the deviation from the actual situation is huge. Therefore, elastic scheduling is an important solution to all of the above problems because this system can dynamically evaluate the system operating status online and makes scaling decisions accordingly.
So, in other words, what everyone at Cainiao quickly realized was that elastic scheduling can help to make things work better and much more efficiently. Knowing this, Cainiao quickly created the relevant architecture, something we will get to in later articles in this series.
The elastic scheduling system of Ark at Cainiao works well because it enables application developers and the O&M personnel to shift their focus from a specific objective, such as the number of containers, to a more abstract objective, such as a reasonable resource usage index and a reasonable service response time. And this in turn reduces the impact of subjective factors that are inevitable under manual estimation. Furthermore, based on the reliable and efficient scaling capability of the Ark platform of Cainiao, the scaling of an application cluster can be carried out just in minutes, which allows for the real on-demand use of resources.
Why Elastic Scheduling Works So Well at Cainiao
Now, although elastic scheduling can produce some great results for Cainiao, it is not necessarily suitable to all companies and organizations. So, why does elastic scheduling work so well for Cainiao anyway? Well, we think it comes down to these reasons:
The business of Cainiao requires that the system be designed to coordinate the transfer of information among several different merchants, several Cainiao partners, and also customers. Moreover, the transfer of logistics order information involves a lengthy process and multiple interactions. This determines whether the transfer of information can run normally given if the mass of information flow is greater than the sequence of actual operations. Therefore, there is a rather slim chance that the elastic scheduling system of Ark will have to deal with traffic peaks generated by flash sale on shopping guide websites.
Next, after Cainiao fully implemented containerization and connected to a hybrid cloud architecture system in early 2017, it completed the transformation of resource management from “machine-oriented” to “application-oriented.” As a result, application deployment, scaling, and other core O&M processes were greatly simplified and improved. Ark, as a container resource control platform, proved its stability after being tested by major promotion campaigns such as Alibaba’s 6.18 Mid-year Shopping Festival and the Double 11 Shopping Festival in 2017. This laid a solid technical foundation for Cainiao to implement elastic scheduling.
Another reason is that most core applications of Cainiao are stateless online computing applications, and the gap between the service pressure peak and valley is significant every day. This provides sufficient scenarios for Cainiao to implement elastic scheduling.
The last reason is because elastic scheduling is not an independent project. It requires the assistance of several basic services and relies on a centralized and standardized system environment. The applications of Cainiao comply with the rules and standards stipulated by Alibaba Group. The elastic scheduling system can directly read monitoring and O&M data from tools such as Alimonitor, EagleEye, and Alimetrics, and technology stacks used by the core applications have been largely converged. This provides a sound environment for Cainiao to implement elastic scheduling.
Based on the preceding points, you can see why and how Cainiao was able to manage to implement elastic scheduling relatively quickly and do so in a cost efficient manner. Of course, businesses and organization that also implement these changes can also greatly benefit from elastic scheduling.
Why Cainiao Is Special
Now, let’s discuss why we think Cainiao is rather special, and how it’s different from other similar products and systems. Many teams at Alibaba Group have developed elastic scheduling products for certain business domains, and some public cloud service providers in the industry also provide elastic scaling services. So then, what are the differences between the elastic scheduling system of Cainiao and those of its counterparts in terms of challenges and product ideas?
Well, first of all, the elastic scheduling system of Cainiao is expected to cover all its own stateless core applications. These applications different greatly in terms of service links, logic, resources, and traffic. Therefore, it is difficult to abstract a generic service model to describe these applications, and unlike elastic scheduling products for specific service domains, the elastic scheduling system of Cainiao cannot involve too many service assumptions during design. Rather, Cainiao must allow for enough generality when designing scheduling algorithms and policy models.
Meanwhile, at Cainiao at least, elastic scheduling needs to be dynamically customizable so that you can cope with different business scenarios. So, in terms of the design of the system structure, Cainiao needs to consider horizontal policy extension so that rapid extension can be performed in new special business scenarios.
Then, there’s also another important aspect to consider. This, of course, is the fact that the more generic a system is when dealing with complex scenarios, the more configuration options it requires. An elastic scheduling service running in the public cloud usually provides many configuration parameters. This is precisely because these sorts of service providers want you to offset the complexity of problems and take responsibility for the stability and costs of utilizing the service. The stability risk and cost-saving effect of this kind of elastic scheduling completely depend on your own understanding of this technology.
By contrast, as the basic technology team of Cainiao, we define our role as a solver of problems with stability and costs rather than leaving more problems to you. We hope to not only provide all application owners of Cainiao with a new resource management function like elasticity in the public cloud, but also help them reduce costs and improve stability.
Therefore, from the beginning of product design, we hoped that the one-click access to the elastic scheduling system of Ark could be available to most application clusters and that the configuration problems of most applications could be solved before they were perceived. We also take the configuration of policy parameters as one of our core responsibilities. For applications with special requirements, we provide additional suggestions to help them connect to the elastic scheduling system of Ark with minimal configuration.