By Jiang Jiangwei (Xiaoxie), VP of Alibaba Group and Head of Smart Infrastructure Product Division.
It is now a common practice for enterprises to digitize and migrate their businesses to the cloud. 5G, industrial internet, artificial intelligence, and cloud computing are the primary constituents of the digital economy and will become the major components of China’s new infrastructure. The cloud will bring revolutionary changes in research and development (R&D) and operations and maintenance (O&M) across IT departments. This article is based on the video of Jiang Jiangwei’s speech, where he introduces Alibaba’s internet-oriented and cloud-oriented R&D model.
1) Cloud Migration Basics
E-commerce and cloud architectures have evolved along the same lines in order to achieve two goals. First, from the R&D perspective, the goal is to make system R&D methods like server R&D. For example, when adding a server for customer access, transaction, commodity, and user systems, we want to simply configure the IP address to launch the server and obtain its computing capabilities. Second, from the O&M perspective, we want to manage hundreds of thousands of servers just as a single server. Therefore, from both perspectives, we hope to use and manage several servers as a single server through the continuous evolution of R&D.
Now with cloud migration as an inevitable trend, developers have three basic concerns.
- Elasticity: Instant scaling up of resources as per the developers’ needs. The application system architecture is very complex and involves hundreds of systems. Therefore, developers want to quickly scale up or down the resources as and when an application system requires elasticity.
- Stability: The development of stability has gone through several distinct stages. First, over the past 20 years, stability has been driven by the stability of the hardware. For example, in some minicomputers, computing resource stability is ensured through hardware redundancy and disaster recovery. After the advent of the internet era, architecture was increasingly used to enhance system stability. However, many companies did not have the ability to ensure stability through architecture and needed to purchase additional physical resources to increase system stability. However, now the cloud has evolved to ensure high resource stability even for inexpensive PCs.
- Manageability: Companies provide a variety of application systems to serve more customers on the cloud. They hope for convenient management of cloud applications and systems similar to app management on mobile phones. Meanwhile, they want simplified management and O&M for the applications and O&M systems developed by different companies.
2) Evolution of Resource Usage Methods
1)Pain Points in Alibaba’s Early Resource Management Model
- Relatively Low Resource and Labor Efficiency
Resource budgets were often inaccurately evaluated. Especially when businesses were growing at a high speed, some may suddenly exceed estimated budgets, for example, due to advertising business growth. In this case, middleware resources are needed. If resource budgets are accurate, the O&M personnel or resource supply staff may provide resources according to the budget. However, resource budget cycles were measured in weeks, which was not a refined granularity. More resources are required when the resource budget is insufficient. The additional resource needs had to be evaluated, approved at each level of the company, and the purpose of the resource use needed to be explained. This created a great deal of difficulty for engineers.
Inaccurate resource forecasts were the norm. Improper resource forecasting used to create problems that engineers must pay attention to while purchasing and using resources. In fact, resource utilization measured across different departments or applications was very low. For example, message-oriented middleware would lack resources due to the sudden increase in message usage by a service. However, the other systems might have a lot of resources at such times, so the overall system load would be very low, and it was difficult to allow resources to be used across systems. With no centralized resource scheduling capabilities, business departments could only manage and use their own resources. As a result, the overall online system resource utilization was low. However, due to the difficulty of recycling and borrowing resources, departments were unwilling to release resources to each other.
2) Architecture Optimization Improves Resource Usage Efficiency
- Full-site Containerization and Centralized Scheduling
Full containerization is a prerequisite for centralized resource scheduling. We hoped that different applications, systems, middleware, and caches could run in containers to make centralized resource scheduling possible. After resource pools are merged, all resources get allocated and scheduled through the resource pool. Full containerization and resource pool merging allowed centralized scheduling of global resources. This solved the problem of resource sharing among different business units and improved resource and inventory utilization. Different business units, such as transaction, search, and cross-industry businesses, have different peak values and times. Therefore, merging the resource pool allowed us to provide ample resources. This solution is advantageous because centralized resource scheduling is preferable to the full use of resources by a business unit through the systems, machines, and code optimization.
- Standardized O&M
Centralized resource scheduling reduces O&M costs and improves delivery efficiency and operational stability. Given that the resource scheduling methods are the same, investment in the R&D or O&M of centralized resource scheduling not only makes scheduling products more convenient and commercial but also reduces the need to invest in human resources. Containerization and centralized resource scheduling have gradually become dominant trends in industry evolution.
The hybrid resource pool raised our resource utilization rate to 40% because different systems have different peak values and times. For example, the peak traffic during Alibaba’s Double 11 event is extremely high, up to 30 times greater than the peak value on normal days. Now, thanks to Alibaba Cloud, it is very easy to cope with the Double 11 traffic peak. In the past, only one-thirtieth of the resources prepared for Double 11 would be used during normal days, resulting in a severe waste of resources. From the perspective of technical evolution, online and offline data can be deployed in a hybrid manner. Offline data is generally used for offline computing after online data generation. The peak usage times of online and offline services differ, and offline services can tolerate slight delays in report generation but not the loss of data accuracy. However, online businesses cannot afford any delay or loss of accuracy. For example, real-time feedback is required when a user places an order. In this case, you don’t ask the user to wait five minutes to see if the order was successfully placed. Currently, the hybrid deployment of resources is ideally the direction of the technical trend. This means offline services are deployed online and online services are deployed offline. With full containerization and centralized resource scheduling to cope with high traffic peaks during Double 11, online services are scheduled offline and offline resources are used to supplement online resources. Using offline resources to assist online services will greatly reduce daily resource consumption. Also, deploy offline resources to online services during normal days to obtain excellent performance.
After hybrid deployment, our resource usage efficiency increased from over 10% to 40%, effectively reducing required hardware investment.
- Product-based Solutions
Containerization, centralized scheduling, and hybrid deployment are all architecture optimization solutions. Moving offline processes online and online processes offline involve many problems. What the cloud needed was a product-based commercial solution that allowed users to purchase and use resources immediately. This would allow engineers and users to significantly improve their resource usage and labor efficiency without the need for complex architecture. Containerization, global resource scheduling, and hybrid deployment are the basic logic of the evolution of internet companies. They can greatly improve capability and resource efficiency and then evolve based on products. For example, Alibaba Cloud products have good performance and stability. The wide range of complex business scenarios, such as Double 11, that Alibaba Cloud must overcome has eventually refined our products.
- Global Cloud Migration Reduces e-Commerce Costs
First, we need to understand the similarities and differences between cloud-oriented R&D models and internet-oriented R&D models. Both types of models attempt to make cloud computing resource management and system architecture development as simple as service management and development. The difference between the models is in their resource pool sizes. Architecture optimization solutions, such as containerization, centralized scheduling, and hybrid deployment, reduce e-commerce costs.
Using product-based solutions for cloud computing also reduces e-commerce costs. The following figure shows a comparison between simulated hybrid deployment and cloud computing in reducing e-commerce costs. The orange blocks represent cloud-based solutions, while the blue blocks represent hybrid deployment solutions. Every business has its ups and downs. This comparison shows that non-cloud hybrid deployment solutions use their own resources rather than cloud resources. As a result, except during service peaks, such solutions have a low resource utilization and cannot handle higher-than-expected peaks. For example, if a hybrid deployment solution does not have the resources to deal with the Double 11 peak, we would have to prepare or purchase additional cloud resources. In contrast, cloud computing solutions elastically release surplus resources so that they constantly maintain a high resource utilization, whether during normal days or the Double 11 traffic peak.
In scenarios such as hybrid deployment and global scheduling, the annual cost of the architecture solution is 1,475 units. In addition, this solution is technically complex, involves service degradation, and requires continuous attention from R&D personnel. By using relatively simple technology, the cloud-based solution costs only 848 units and requires a lower investment in human resources.
- Cloud Migration Benefits Database Application
Database budgeting is a complicated matter. When the traffic or concurrency is high, you need to perform database and table sharding and ultimately hash an ID. Generally, a database budget is drawn up every three years to reserve resources, resulting in a serious waste of database resources. The main difficulty is that storage and computing are not separated, and databases consume a large number of computing resources even when they do not perform much computing. PolarDB X is a high-performance product developed by Alibaba Cloud. The evolution from MySQL database sharding to PolarDB X storage and computing separation is a typical process of database cloud migration.
PolarDB X adopts distributed storage. This architecture enables the elastic release of excess computing resources and the elastic reclamation of resources. Alibaba Cloud offers three basic products for different requirements: X-Dragon for computing, Apsara Distributed File System for storage, and Apsara Luoshen for the network. Apsara Distributed File System provided early support for various cloud and Alibaba Group businesses. It separates computing from storage and allows on-demand applications for storage resources. This offered many benefits to the businesses. After Apsara Distributed File System moved to the cloud, we turned it into a product. Now, customers directly use PolarDB and PolarDB X. Alibaba Cloud databases provide obvious advantages across performance and O&M. This implies customers now use Alibaba Cloud databases for R&D and other purposes to significantly reduce costs.
- Large-scale Cloud Computing Offers a New Feature — Preemptible Instances
The service level agreement (SLA) guarantees for preemptible instances are not high and the instance prices are very low. Preemptible instances are not an original invention of Alibaba Cloud. They were already a mature mechanism on AWS and used to a great extent in many companies. The following figure shows a mobile advertising and data analysis company that uses preemptible instances to significantly reduce costs. The split servers contain some fragment resources and inventory resources, which can be purchased as preemptible instances. The SLA states that preemptible instances are not guaranteed. If another customer requires a complete resource, the instance may be released because it contains fragment resources of the required complete resource. However, the customer is notified five minutes before the instance is reclaimed. Of course, if the customer has strong R&D capabilities, they can build their own platform and use a large number of fragment resources for scheduling. This minimizes costs while ensuring SLA guarantees. For companies with strong R&D capabilities, such as Alibaba Cloud, the use of preemptible instances allows the company to reduce the number of normal computing instances by 80% while retaining the same level of computing power.
- Efficient Resource Usage Examples
As shown in the following figure, some online travel websites adopt elastic billing to take advantage of the significant variations in business traffic. Some media companies perform elastic operations to take advantage of the varying popularity of content. Some rendering companies use preemptible instances and economic instances to reduce costs and improve rendering efficiency. The following figure shows past data, but the current data is probably even better.
Cloud-oriented R&D is different from the past. In the past, the resource pool could be regarded as a small pond, but a cloud-oriented resource pool is like an ocean. Although cloud resources may still be wasted, cloud-oriented resource pools offer overall greater advantages.
The following figure shows that the resource usage efficiency of an O2O enterprise is significantly improved after cloud migration. The company’s business peaks occur during lunch and dinner hours, and the procurement costs of one-time servers are high. The company solved its daily peak problem with cloud solutions, such as global centralized resource scheduling, containerization, and elastic instance scaling in seconds, reducing costs, and improving service capabilities.
3) Evolution of Stability
Stability is an aspect of performance that every company pays close attention to. Stability is highly related to product quality and customer satisfaction, making it very important. The evolution of stability can be roughly divided into three stages.
The Early IT Age
- Stability Provided by Commercial Software and Hardware Vendors: The user architecture is simple, but the costs are high. Mainframes and minicomputers were used to ensure stability. In the past, many stability-related operations were performed on hardware by using the then-existing architecture methods, resulting in relatively high costs. For example, if one link fails, the system switches to the other by using a redundancy method, such as adding two network interface cards (NICs) or extra memory.
The Internet Age
- Architecture Optimization for Improving Stability: This approach involved complex but low-cost technology. Cheap PC servers are not as stable as minicomputers. Due to issues with the memory, motherboard, CPU, kernel, and other components, PC servers had a 0.003% downtime rate, but they offered the same computing density at a much lower price. Therefore, the stability needed to be improved through architecture optimization. If one server IP address is unreachable, that server is removed from the cluster. If the entire cluster has a problem, the business is switched to another cluster. For example, Alibaba e-commerce businesses may switch from an unavailable server to a new server through a flexible and highly available architecture that provides powerful scheduling, fault tolerance, and disaster recovery capabilities.
- High-availability Architecture — Capacity: For the Women’s Day event, we had to determine how many transactions we needed to support. For each application, it was necessary to carry out capacity planning, determine the required computing resources, and perform full-link stress testing.
- High-availability Architecture — Application Governance: We needed to understand the dependencies between applications. Then, based on the dependencies between applications, we could perform link tracking, fault drills, monitoring, and determine whether the cache was unavailable.
- High-availability Architecture — Operation Control: The system may not be able to meet all capacity requirements and might require throttling and degradation. If a customer requires the capability to handle 100 transactions, the business team may consider this value to be too high and only provide the capability for 80 transactions. In this case, 20% of the transactions during peak periods will fail, while the customer’s transaction capabilities could be reduced to 80% at other times. An appropriate solution would be to implement throttling for a few minutes during peak hours, which would not have much of an effect on transactions during normal times and would allow the customer to reduce reserved resources by 20%. In addition, such a solution could provide switch plans and monitoring alerts.
- High-availability Architecture — Fault Tolerance and Disaster Recovery: This includes the local active-active mode, active geo-redundancy, and traffic scheduling.
Product-based Cloud Computing
- Cloud Computing Provides High Single-point Stability Through a Combination of Software and Hardware: The user architecture is simple and the cost is low. Regardless of cost considerations, the ideal cloud model is to sell minicomputers while providing the products and capabilities of internet architecture. This way, individual hosts are stable while the architecture is powerful. Some cloud customers are large while the others are small and their technical capabilities vary. However, the requirements for cloud products and servers are equally strict. Alibaba Cloud currently provides the highest SLA guarantees in the world. To accomplish this, we have invested heavily in resources. Alibaba Cloud hopes that the stability and availability of individual computers meet extremely demanding standards, providing customers with higher stability than physical machines. The lifecycle of ECS instances is around three years, which is less than that of minicomputers. Therefore, it is not cost-effective to implement high availability for cloud hardware and servers. In addition, the cloud uses inexpensive hardware to achieve high stability. Therefore, raising the cost of physical resources is not in line with the cloud trend. Thus, objectively speaking, cloud computing is not as stable as minicomputers.
Cloud computing provides single-point stability and distributed full-stack products and solutions through market-based mechanisms. The following figure shows how to improve the stability of cloud computing. Considering an example, where hardware system monitoring along with AI-enabled fault prediction predicts metrics such as disk and motherboard failure times, failure rates, and fault causes, providing early warnings and facilitating the corresponding migration. Migration can be performed in the cloud. If failures can be predicted, the relevant compute instances can be promptly migrated to imperceptibly eliminate failures and reduce the probability of downtime by 80%. In addition, the high-availability network-wide deployment capability can eliminate hardware faults and single-points-of-failure (SPOF) threats to businesses. For example, you can distribute resources to different batches and different locations. These capabilities assume a certain business scale, so the larger the business scale, the more effective these methods are, and the higher the stability they provide.
- Full-stack Stability Solutions and Products: Alibaba Cloud has developed high-availability products through internet-based capabilities and Alibaba capabilities in the fields of finance, e-commerce, and logistics. Performance Testing Service (PTS) is a comprehensive stress testing tool that initiates end-to-end stress testing for large-scale online services and simulates highly realistic online traffic for internet-based architectures. PTS helps to determine the queries per second (QPS) and transactions per second (TPS) of the target resources and balance system resources.
4) Discussion and Conclusion
The Best Carrier of Containers
Physical machines or virtual machines? Containers are the current trend. Many companies provide container services based on physical machines. Objectively, it is not recommended to virtualize a container multiple times through virtual machines. Therefore, container services based on physical machines are preferable. However, physical machine-based container services waste the advantages of the cloud.
The cloud has five main advantages. First, it has strong elasticity, which greatly reduces costs and improves availability. Second, it enables highly efficient supply chains. Off the cloud, it’s not possible to create a compute instance simply with a click. Instead, it requires to carry out many complex processes. Third, it provides high financial efficiency. Off the cloud, purchased assets are expected to last three or four years, but cloud servers do not depreciate or break. Fourth, the cloud is stable. Professional companies provide cloud services and hence these services are more stable than the customer’s own services. Fifth, the cloud provides a rich selection of products, with new products constantly emerging. This includes both Alibaba Cloud products and our partners’ products. Customers can also convert their own capabilities into products and services that can be provided to other customers in the cloud. Therefore, the cloud provides a richer ecosystem.
- X-Dragon Subsystem: X-Dragon servers use third-generation virtualization technology designed for container services. VMware is a typical example of first-generation virtualization technology, and Xen and KVM are second-generation technologies. Third-generation virtualization technology integrates software and hardware to address the defects of the previous two generations. Since virtualization is implemented through hardware, the virtualization overhead is almost zero, greatly reducing costs. If you need stronger control capabilities and have more funds and manpower to invest in operation and control, use the X-Dragon subsystem for container service. Of course, cloud-based virtual components or other servers can also be used.
Manage Cloud Applications and Mobile Apps
Alibaba seeks to use standardization to simplify the management of cloud applications. Otherwise, the services, software, and application systems provided by different companies will have different management methods. The resulting large number of R&D, testing, and production environments and private cloud, public cloud, and Internet-of-Things (IoT) environments will make management difficult. However, any mobile user can easily manage their mobile apps, due to the standardization of the operating system, such as Android or Huawei EMUI.
Open Application Model
In cooperation with Microsoft Azure, Alibaba Cloud launched the Open Application Model (OAM), which is the first standard definition and architecture model for cloud-native applications. OAM defines the components, dependencies, and architecture of applications at the developer layer, the O&M configuration and runtime parameters of the application at the maintenance layer, and describes the application at the platform layer. Install OAM in a single click and run in multiple locations. It exposes the basic capabilities and features of the platform in a standardized way. As long as they meet the OAM standard, KPS environments built by Alibaba Cloud, other clouds, and other individuals can be managed more easily.
- Cloud Computing Innovation: Using cloud-based architectures, developers continue to create innovative R&D and O&M models. We must focus on the use of cloud resources, the resources provided by cloud vendors, and the tools that allow us to flexibly use resources. We must make full use of the advantages of cloud computing to ensure application stability. With OAM, cloud application management is as simple as mobile app management.
- Cloud Computing Challenges: First, ensure the stability of cloud computing overpowers that of minicomputers while keeping prices low. Second, IDC construction must be more environment-friendly and energy-efficient. Third, we must ensure the steady process of trusted technology innovation. Fourth, we must transition from cloud-based operation to cloud-based development.
- Clouds of the Future: With the development of the internet, computers have increasingly affected all aspects of human life and created significant value. Similar to cloud computing, cloud resources and features will give rise to new R&D models. With the rapid development of the cloud, it will bring yet more profound changes. Today, the focus of the cloud is on resources, costs, O&M, elasticity, efficiency, and stability. In the future, the cloud will have a more profound effect on applications and business models and create more value for our society.
Get to know our core technologies and latest product updates from Alibaba’s top senior experts on our Tech Show series