Deploying Tens of Thousands of Servers in Minutes with Container Technology
By Alibaba Cloud Enterprise Application Team
Usually, large scale promotions such as the Double 11 Shopping Festival require a peak-hour traffic estimation ahead of time. However, the pre-calculated resources and application capacity might not be sufficient to support the peaks in traffic, and last-minute scaling needs to be available. Container technology works well for this scenario, as it supports rapid, automatic, and elastic scaling as needed.
During this year’s Double 11 Global Shopping Festival, Alibaba Cloud’s container image warehouse stored 300,000 different images, totaling 10 million copies of images, which were downloaded up to 800 million times.
How can we support so many customers with simultaneous business peaks requiring a large number of server resources?
Billions of Image Pulls Powered by Container Technology
As an agile, portable, and controllable lightweight virtualization technology, the container technology has been popular among developers since its introduction. More importantly, container technology establishes a standardized delivery method — container images.
As a package that encapsulates both the application code and the code environment dependencies, the container image is an environment-independent deliverable that is applicable at any stage of the software lifecycle. Like how physical containers revolutionized the logistics industry, the virtual container technology that created container images has revolutionized traditional software delivery models.
The entire container technology industry has experienced explosive worldwide growth over the past three years. According to statistics, 67% of enterprises adopt or plan to adopt Docker in their production process to help them implement agile development and improve their R&D delivery efficiency.
According to the Docker Con 2017 statistics, there are about one million Docker apps, a 30 times increase over the past three years. There are more than 11 billion container image pulls, an almost exponential increase over the past three years.
Double 11 Shopping Festival and Container Registry
Alibaba began to implement container technology as early as 2015. Alibaba containerized all core transaction apps during the Double 11 Global Festival in 2016, handling up to 175,000 orders per second through the use of tens of thousands of containers. Before last year’s Double 11 Global Shopping Festival, Alibaba achieved full containerization of online services within the group, requiring an internal deployment of over one million containers. This helped handle the peak transaction volume of 325,000 orders per second, and enabled rapid deployment of tens of thousands of servers within ten minutes.
Currently, Container Registry has hosted 100,000 images group-wide, with over 200 million total downloads.
Multi-Faceted Container Optimization
In order to handle the pressure, we have optimized our container service from multiple angles. When we began implementing containers, with tens of thousands of 10k+ user apps being released daily, we experienced an increased release failure rate which affected our normal iterative business process. The root cause of this issue was the vast number of app image pull requests during the release process, combined with the over-redundant and bloated app image content. At that time, we had many 5 GB+ images, and the file server wasn’t able to handle that many requests at that size.
To solve the container image acquisition problem in large-scale and high-concurrency situations, we optimized our container service from three directions — container image size, content acceleration and traffic control, and image registry performance.
Container Image Size Optimization
Previously, we wrote all app build processes in the same Dockerfile, including the compilation, testing, and packaging of the app and any dependent libraries. This resulted in bloated and deeply nested images, and potential source code disclosure. We optimized the multi-stage image building process by separating intermediate products from the final image build, and produced the most compact app images.
We aggregated commands with similar functions to the same layer, and built frequently-used apps or environments into basic images that could be reused, minimizing the number of image layers required. We have held many off-line workshops sharing our container image optimization best practices.
Container Content Acceleration and Traffic Control
In large-scale image distribution scenarios, image optimization has limited affect. We also need to consider improvements to the image pull performance of the file distribution system. We naturally thought of server expansion, but that just shifted the bottleneck to the backend storage. In addition, many client requests from different IDCs consume significant network bandwidth, causing network congestion. At the same time, many businesses were moving towards internationalization and many applications were being deployed overseas. Overseas server downloads rely on back-to-origin domestic operations, wasting significant international bandwidth as well as being very slow. If any failures occur during transmission and large files need to be re-downloaded, or the network quality is poor, the whole process is very inefficient.
In order to solve these problems, Alibaba Cloud launched Dragonfly. Dragonfly resolves large-scale file downloading and many other file distribution issues in cross-network, isolated scenarios using P2P technology combined simultaneously with smart compression, smart traffic control, and a wide range of innovative technologies. This significantly improves data push, large-scale container image distribution, and other business capabilities.
Images are downloaded by layers to local machines. Next, I will introduce the Docker Pull implementation process:
Docker Daemon invokes the manifest for registry API image acquisition. From the manifest, it can figure out the URL for each level. Soon afterwards, Daemon performs a parallel download of the image layers from the registry to the host’s local database.
Ultimately, therefore, the problem of image transmission becomes the problem of concurrent downloading of each image-layer file. But what Dragonfly is good at is precisely using the P2P mode to transmit each layer’s image files to the local database.
So, specifically, once again, how do we make this happen?
Dragonfly activates a proxy on the host. All of the Docker engine’s commands and requests go through this proxy. Let’s look at the diagram below:
First, the docker pull command is intercepted by the dfget proxy. Then, the dfget proxy sends a dispatch request to the CM. After the CM receives the request, it determines whether the corresponding downloaded file was locally cached. If it hasn’t been cached, then it downloads the corresponding file from the registry and generates seed block data (as soon as the seed block data is generated it can be used). If it was already cached, then block tasks are immediately generated. The requester analyzes the corresponding block task and downloads block data from other peers or supernodes. A layer has been downloaded once all blocks from the layer finish downloading. Similarly, the whole image has been downloaded once all the layers finish downloading.
Dragonfly supports multiple container technologies. No alteration has to be made to the container itself. Image distribution can be accelerated to as much as 57 times faster than using native mode, and outbound traffic on the Registry network is reduced by over 99.5%. It provides strong support for Alibaba’s rapid business expansion and large-scale promotion of the Double 11 Global Shopping Festival.
Image Registry Performance Optimization
Dragonfly utilizes P2P to get the image layer from the registry and transfer it to the local disk. However, when there are tens of thousands of concurrent image pulls, getting the image’s manifest file from the registry would become a performance bottleneck. Container Registry has many custom enhancements in terms of code and infrastructure:
In terms of code: Container Registry has been optimized for Docker Registry, which allows it to automatically analyze hotspot data based on previous image requests. In addition, hotspot data is cached so that the Container Registry can easily handle large-scale concurrent image manifest pulls. A dynamic image download source determination function was added, which automatically returns the nearest registry address from the image download location based on different sources of image download requests.
In terms of infrastructure: In order to cope with spikes in traffic, the Container Registry has enhanced multi-dimensional registry traffic and storage monitoring. It performs heartbeat detection on the network, and collects the monitoring data in real time. When the monitor detects that a threshold has been reached, the Registry automatically sends an alert and performs elastic infrastructure scaling.
Benefits of Container Registry
Alibaba’s container technologies were made available to cloud users with the official launch of Container Registry (public beta) in October 2017. Container Registry allows you to manage images throughout the image lifecycle. It provides secure image management, stable image build creation across global regions, and easy image permission management. This service simplifies the creation and maintenance of the image registry and supports image management across regions. Combined with other cloud services such as Container Service and CodePipeline, Container Registry provides a one-stop solution for using Docker in the cloud.
Multi-Region Support and Image Accelerator
Alibaba Cloud Container Registry provides multi-region support around the globe. Users can host container images close to their own business locations to optimize image upload and download speed. We also provide every user with a dedicated cutting-edge image accelerator, which allows you to quickly obtain images from anywhere in the world. The image accelerator includes internally developed intelligent routing and dynamic caching technologies, which greatly improves the image download speed and user experience. It is also fully compatible with Docker native parameter configuration, and supports Linux, MacOS, and Windows operation systems. So far, Container Registry has accelerated tens of thousands of image pulls, saving users 192,000 hours.
Alibaba Cloud Container Registry provides extensive and stable image building functions, including automatic build, overseas build, and multi-stage build, providing DevOps with convenient tools to implement best practices for containers on the cloud. You can host your app code on Alibaba Cloud Code, Github, Bitbucket or your own GitLab, and compile and test the code using Container Registry’s multi-stage build function or Alibaba Cloud CodePipeline. Once the image is built, it will be pushed to Container Registry where it will be hosted. At last, Container Registry’s webhook will dynamically request the corresponding app on the Container Service cluster for redeployment.
The entire solution enables automatic code testing after a user submits the code, automatic image building after passing the test, and image deployment to the test, pre-release, or production environment cluster.
Image Security Scan
When images become a core asset of an enterprise and the core of the enterprise software delivery pipeline, online application security becomes crucial, too. Container Registry provides a convenient image security scan function that provides multi-dimensional vulnerability reports. An image vulnerability report contains the CVE identification number of the vulnerability, the vulnerability severity, the vulnerability location, and solutions provided by the official service provider and the community.
Container Registry has been optimized for better registry performance. Dragonfly adopts P2P technology in combination with intelligent compression and traffic control solutions. Container Registry and Dragonfly work in conjunction to solve various large-scale file download and cross-network isolation file distribution challenges. They have become crucial components of Alibaba’s infrastructure, providing strong support for Alibaba’s rapid business expansion and large-scale promotion of the Double 11 Global Shopping Festival.
Alibaba’s container technologies were made available to cloud users with the official launch of Container Registry (public beta) in October 2017. Container Registry provides secure image management, stable image build creation across global regions, and easy image permission management free of charge. These features help you manage container images throughout the image lifecycle on the cloud, and experience the agile revolution through container technologies.
In addition, the enterprise edition of Container Registry is about to launch. The enterprise edition will allow users to host images on their own OSS instances, provide a P2P image plug-in to support large-scale image pulls, and provide enterprise level features, such as custom domain name configuration, image instance synchronization, and image security scan.