Conversation with the First Chinese TOC about How Dragonfly Became a CNCF Incubation Project
On April 10, 2020, the Cloud-Native Computing Foundation (CNCF) Technical Oversight Committee (TOC) voted to promote Dragonfly, an open-source project from China, to a CNCF incubation project. Dragonfly had become the third Chinese project to enter the CNCF incubation phase after Harbor and TiKV.
Founded in July 2015, CNCF is one of the important open-source organizations under the Linux Foundation. Focusing on microservices, DevOps, continuous delivery, and containerization, CNCF is committed to maintaining and integrating open-source cloud-native technologies to support the orchestration of containerized microservice applications.
Currently, CNCF has more than 300 member companies, including AWS, Azure, Google, Alibaba Cloud, and other mainstream global cloud computing vendors. The CNCF TOC is composed of 11 representatives with profound technical knowledge and extensive industry background. They provide technical leadership for the cloud-native community.
Today Cloud has become a public infrastructure, and cloud-native is considered to be the 2.0 standard for cloud computing technologies. Therefore, CNCF is the leader in the development of cloud-native technologies and plays an important role in the industry. How did the Dragonfly project become a CNCF incubation project? And what role does Dragonfly play in the cloud-native technological ecosystem? To get an insight into the Dragonfly project and the current development of cloud-native technologies in China, we interviewed Li Xiang, the first Chinese member of the CNCF TOC, and a senior technical expert at Alibaba Cloud. He shared some relevant information on CNCF and Dragonfly.
About CNCF and TOC
CNCF is one of the most influential open-source organizations and Li Xiang is one of the 11 representatives in the TOC. According to Li Xiang, the CNCF is essentially a project-centered foundation, whose goal is to absorb better projects and then attract more end-users through those projects. With more and more customers using CNCF projects, vendors will integrate these open-source projects into products or cloud services for customers to use at lower costs and higher efficiency. This helps the entire cloud-native ecosystem form a closed loop of healthy development.
CNCF hopes to connect foundations, developers, and vendors through projects. Therefore, the core goal of TOC is to collect the best and most suitable projects in accordance with the foundation’s cloud-native concepts. Hence, Li Xiang’s main job is to find the best projects, just like a talent scout.
Li Xiang also introduced the project promotion mechanism within CNCF. Each CNCF project needs to go through three stages: sandbox, incubation, and graduation.
The sandbox stage means the early stage of development. Each TOC member will look for some potential projects, provide them with suggestions, and promote them into the sandbox stage. Unlike the Linux foundation, which has been developing for more than a decade, the CNCF foundation still needs to define some standards and processes. For example, what exactly does a sandbox-stage project mean? What is the process for entering the sandbox stage? How can a project enter the incubation stage from the sandbox stage? What is its standard? Defining these standards and processes is the responsibility of TOC. Therefore, Li Xiang has spent a lot of time on it.
In addition, TOC has other aspects to take care, for example, how to allocate the limited resources of CNCF to ensure that the foundation is project-centered, and how to not only keep CNCF innovative but also maintain its concept of advancement, neutrality, and cloud-native while absorbing a large number of projects.
What Is Dragonfly?
Dragonfly mainly resolves the image distribution problems in Kubernetes-based distributed application orchestration systems. The Dragonfly architecture solves four major challenges: large-scale image downloading, long-distance transmission, bandwidth cost control, and secure transmission.
1) Large-scale Image Downloading
- PouchContainer: Alibaba Group’s open-source enterprise-level rich container engine technology, which is efficient and lightweight.
- Registry: The repository for container images. Each image is composed of multiple image layers, and each image layer is a common file.
- SuperNode: The Dragonfly server, which manages the lifecycle of seed blocks, constructs a peer-to-peer (P2P) network, and schedules clients to transmit specified blocks to each other.
- Block: When Dragonfly is used to download an image layer, SuperNode splits the entire file into blocks. Each block in the SuperNode is called a seed block. Seed blocks are downloaded by a number of initial clients and rapidly propagated among all clients. The size of a seed block is determined by dynamic calculation.
- DFget: The Dragonfly client installed on each host. It is responsible for uploading and downloading seed blocks and command interactions between container Daemons.
- Peer: Hosts that download the same file are called peers.
The process of downloading an image is as follows:
1) The PouchContainer sends a Pull Image command, which is received by the DFget agent.
2) The DFget sends a scheduling request to the SuperNode.
3) After receiving the request, the SuperNode checks whether the corresponding file has been cached locally. If not, the SuperNode downloads the corresponding file from the Registry and generates seed blocks. (Once a seed block is generated, it can be propagated immediately, instead of waiting for the SuperNode to download the entire file.) If the file is already cached, the SuperNode generates a block-splitting task.
4) The client parses the task and downloads blocks from another peer or the SuperNode. When all blocks at one layer are downloaded, the layer is downloaded and will be passed to the container engine. When all the layers are downloaded, the entire image is downloaded.
Using the preceding P2P technology, Dragonfly completely solves the bandwidth bottleneck of the image repository and makes full use of the hardware resources and network transmission capabilities of each peer. Therefore, the larger the scale, the faster the transmission. It is worth mentioning that the Dragonfly architecture does not involve any changes to the container technology system. Therefore, it seamlessly supports containers to have the P2P image distribution capability, which greatly improves the efficiency of file distribution.
2) Long Distance Transmission
By using the Content Delivery Network (CDN) cache technology, Dragonfly enables each client to download seed blocks from the nearest SuperNode rather than download through a cross-region network. The CDN cache works as follows:
The first requester of a file will check the cache location based on the request. If no cache exists, back-to-source synchronization will be triggered to generate seed blocks. Otherwise, a HEAD request is sent to the source station with the If-Modified-Since field. The value of this field is the file’s last modification time last returned by the server. If the response code is 304, it indicates that the file in the source station has not been modified, and the cache file is valid. Then it determines whether the file is intact based on the metadata of the cache file. If it is intact, the cache is completely hit. If not, the remaining part of the file needs to be downloaded in segments by using resumable upload. In order to use resumable upload, it is critical to ensure that the source station supports segmented download. Otherwise, there is a need to synchronize the entire file. If the response code of the HEAD request is 200, the source file has been modified, and the cache is invalid. In this case, back-to-source synchronization is required. If the response code is neither 304 nor 200, it indicates that the source station is abnormal or the address is invalid, and the download task fails directly.
The CDN cache technology solves the problem of back-to-source downloads and nearby downloads. However, if the cache is not hit, the back-to-source synchronization efficiency of the SuperNode is so low that it leads to a direct impact on the overall distribution efficiency. To address this issue, Dragonfly uses an automatic tier-based preheating mechanism to maximize the cache hit rate, which works as follows:
In the process of pushing an image file to the Registry by using the Push command, each time a layer of the image is pushed, the SuperNode is immediately triggered to synchronize the layer image to the SuperNode locally in a P2P way. It makes full use of the time interval (about 10 minutes) between the Push and the Pull operations performed by the user to synchronize the files at all layers of the image to the SuperNode. In this way, when the user runs the Pull command, the cache files in the SuperNode can be directly used. Therefore, there is no long-distance transmission problem naturally.
3) Bandwidth Cost Reduction
By using dynamic compression, Dragonfly implements corresponding compression policies for file parts that are most worthy of compression without affecting the normal operation of the SuperNode and the peer. Therefore, it saves a lot of network bandwidth resources and further improves the distribution rate. Compared with the traditional HTTP-native compression method, dynamic compression has the following advantages:
First, the dynamic nature ensures that compression is enabled only when the SuperNode and peer loads are normal. At the same time, only the most valuable parts of the file are compressed, and the compression strategy is dynamically determined. In addition, the compression rate is greatly increased through multi-thread compression. With the help of the SuperNode’s caching capacity, compression is performed once during the whole download process. Therefore, the compression gain is improved by at least ten times compared with the HTTP-native compression method.
Besides dynamic compression, the SuperNode’s powerful task scheduling capability allows peers under the same network device to exchange blocks as much as possible, so as to reduce the traffic between network devices and data centers. This further reduces network bandwidth costs.
4) Secure Transmission
Transmission must be secured when downloading sensitive files, such as key files or account data. Regarding this, Dragonfly has provided the following functions:
1) HTTP header transmission is supported to satisfy download requests that require permission verification through the header.
2) A self-developed data storage protocol is used to pack and transmit data blocks. In addition, re-encryption will be performed on the packaged data later.
3) Encryption plug-ins will be supported.
4) Multiple validation mechanisms are adopted to prevent data from being tampered with.
How Was Dragonfly Promoted?
Dragonfly’s entrance to the CNCF incubation stage indicates that the Dragonfly project possesses highlights that appeal to TOC. Li Xiang said that Dragonfly solves the container image distribution in large-scale scenarios, which is quite different from traditional solutions.
Traditional solutions use centralized storage and distribution. This method features simple implementation and convenient management and control. However, it encounters some challenges in large-scale scenarios since it is difficult to horizontally scale out and handle bursty traffic more flexibly.
To give an example, in internal Alibaba scenarios or in container service scenarios, especially for some batch computing businesses, there can be a throughput of 1,000 containers created per minute, which puts corresponding pressure on the image distribution. To deal with highly massive traffic, the best way is to distribute images by using the P2P feature. Dragonfly is a system built on this concept to help users and enterprises cope with large-scale container scenarios so that the container ecosystem covers more and more complex scenarios. Dragonfly’s concept is relatively advanced in the field of containers. It is the first attempt and a successful exploration and practice.
While talking about how Dragonfly was promoted from the sandbox stage to the incubation stage, Li Xiang introduced the internal criteria for an evaluation in CNCF. First, CNCF has some basic requirements for incubation projects, such as project maturity, popularity, and distribution of contributors. Dragonfly perfectly meets these incubation requirements.
On the other hand, CNCF also considers whether the project can help cloud-native technologies and the community develop and whether it helps CNCF develop as a foundation. This part of the evaluation is relatively subjective, so it is necessary for the 11 members from TOC to vote on it. According to the vote, most members recognized Dragonfly’s value to the cloud-native field and to the foundation, and therefore Dragonfly was promoted to an incubation project.
In fact, during the sandbox stage, Dragonfly had already demonstrated its value in some practical production environments. It can be applied in various industry scenarios such as e-commerce, telecommunications, finance, and the Internet. It is also used by various customers, including Alibaba, China Mobile, Shopee, Bilibili, Ant Financial, Huya, Didi, and iFlytek.
For example, China Mobile Zhejiang branch has adopted Dragonfly in the production environment for more than three years, involving more than 1,000 physical computers. Currently, more than 200 business systems and 1,700 application modules are running on Dragonfly. For another example, Shopee, a Singapore E-commerce platform, has adopted Dragonfly for more than one year in the production environment, involving more than 10,000 physical machines. Additionally, Bilibili, a Chinese video website with bullet screens, has also adopted Dragonfly on more than 3900 machines in testing and production environments. Furthermore, engineers from Bilibili work with and make contributions to the Dragonfly community on registry verification and stability.
Needless to say, Li Xiang, as a member of TOC and a senior engineer at Alibaba Cloud, has provided many technical and ecological suggestions to the project team in the process of promoting Dragonfly. For example, how Dragonfly connects with the Harbor ecosystem and interacts with Alibaba Cloud products, and how to better popularize Dragonfly to end-users.
What Does the Incubation Stage Mean?
We discovered that Li Xiang is also the author of etcd, another CNCF open-source project in the incubation stage. Then what does entering the incubation stage mean for project maintainers?
“This means a greater responsibility for serving cloud-native users and developing the ecosystem. The follow-up work for Dragonfly also focuses on these two goals. To serve cloud-native users, we need to simplify the installation and upgrade processes and improve basic capabilities such as usability and security so that users easily use Dragonfly in enterprise-level scenarios. To develop the ecosystem, we will promote the harmonious development of the Dragonfly and CNCF ecosystem, improve the integration capability, and better cooperate with projects like Harbor, Quay, and Claire. And we will promote OCI in the distribution-related standardization,” said Li Xiang.
On the other hand, for the CNCF organization, the promotion of new projects from the sandbox stage to the incubation stage also means that the cloud-native ecological territory has been further expanded. Li Xiang revealed that projects in the sandbox stage are not formal CNCF projects. Therefore, CNCF will not put many resources, including operations, markets, technical guidance, into projects in the sandbox stage. On the contrary, incubation is the first stage of a formal CNCF project. Hence, the foundation will provide more resources in assisting and supporting Dragonfly. The foundation will also provide more technical guidance and help Dragonfly successfully graduate and become an important part of the cloud-native field.
Speaking of the last stage of a CNCF project (graduation), Li Xiang said that the main considerations for graduation are the overall ecological health of the project and the maturity of the project. CNCF hopes that the graduated projects can be put into production and meet the requirements of most enterprises.
Prospects of Cloud Native
Nowadays, global cloud-native construction is in full swing, and many first-tier companies are also actively embracing cloud-native technologies. Li Xiang said that the cloud-native technologies are developing towards two directions:
- Standardization: More and more cloud-native technologies emerge, which makes it difficult to uniformly describe, manage, and connect these technologies.
- Floating Up to the Application Layer: Cloud-native started at the infrastructure layer, and it is gradually moving towards the application layer that is closer to users. Eventually, it will realize the vision that software is naturally born on the cloud and growing on the cloud. Currently, the main obstacle is that the connection between the infrastructure layer and the application layer has not been fully established. The technologies required by the connection are scattered. Hence, the value of cloud-native technologies is not fully demonstrated to users at a closer layer. At present, Alibaba is actively participating in building this connection by helping CNCF Landscape complete the application definition and delivery, on which CNCF SIG App Delivery is also working.
Currently, the Kubernetes orchestration system is the core of the cloud-native ecosystem construction. There is a saying that Kubernetes is the Linux of the cloud-native era. At the same time, another saying seems to be more general: Cloud-native is the cornerstone of open source. Li Xiang expressed his understanding of these two sentences: “The success of Kubernetes is actually attributed to the standardized abstraction of cloud infrastructures such as computing, network, and storage. This is similar to Linux, which is a standard operating system that shields the underlying hardware details. It is the standardized infrastructure abstraction that allows the cloud computing ecosystem to define more application capabilities layer by layer on the abstraction, and efficiently realizes the core value of connecting applications with the cloud.”
At the end of the interview, Li Xiang also gave some suggestions to developers who want to learn cloud-native technologies: At present, the development of cloud-native technologies abroad mainly focuses on resource infrastructure management, application infrastructures (such as service mesh and observability), and application O&M and delivery technologies. In comparison, we mainly focus on infrastructure management in China. However, recently we are shifting up to the developer-oriented application layer. For young developers, the CNCF official community and blogs are all good channels to learn the rudiments of cloud-native technologies.
About Li Xiang
Li Xiang has a bachelor’s degree from Zhejiang University, a master’s degree from Carnegie Mellon University, and is one of the founders of CoreOS. He has also participated in the creation of a few open-source projects such as the etcd, Operator Framework, and Rkt. In open-source communities, Li Xiang is well known to developers as the author of etcd. The etcd project has absorbed more than 400 contributors, attracted 14000 submissions, and released more than 150 versions. In January 2019, Li Xiang became the first Chinese TOC member in CNCF.
After joining Alibaba Cloud, Li Xiang became primarily responsible for Alibaba Cloud’s large-scale cluster scheduling and management system. He helps in completing the preliminary infrastructure transformation by using cloud-native technologies. The transformation significantly improves resource utilization, software development and deployment efficiency, and supports the evolution of cloud products.
This article is reprinted from Open Source China.