By Chen Xingyu (Yu mu), Technical Expert from Alibaba ‘s Fundamental Technology Mid-end team.
Alibaba has been using etcd internally to store key metadata for three years now. During this 2019’s Double 11 shopping event, etcd played a critical role just as it did before and being put to the test under some extreme pressure. Aside from using etcd internally, this year, we also shared some of our optimized code and some of the practices that we used with the community to some positive feedback.
In this article, we will take a look at how we used etcd at Alibaba and how we made it faster as well as more stable and efficient. We will also share some of the ways in which we used etcd. Last, at the end of the article, we will touch on what a more intelligent etcd would look like. We hope that this article will help more people understand etcd and learn how to enjoy the benefits of cloud native technology.
etcd Performance Optimizations
This first section of our article will focus on how we made several optimizations to made etcd perform faster and better. Before we begin, let’s take a look at how etcd works and where it can encounter performance bottlenecks.
etcd is usually composed of the layers that are shown in the following figure:
Each layer has its own performance bottlenecks, explained below:
- Raft: Raft is a protocol that a cluster of nodes can use to maintain a replicated state machine. Its performance depends on factors such as network I/O and the round-trip time (RTT) among nodes. Next, write-ahead logging (WAL) is dependent on disk I/O speed.
- Storage: The storage layer is responsible for persistent key-value storage and its performance is dependent on disk I/O such as fdatasync latency, memory treeindex lock block, boltdb Tx lock block and the performance of boltdb itself.
- All Etc layers: They are affected by other factors including host kernel parameters and gRCP APIs.
Now that we are familiar with how etcd works, let’s look at we optimized it for better performance. The process of how we optimized it can be divided into two main parts: the server optimization and client optimization. This section deals with server optimization.
etcd is demanding in terms of CPU, memory, and disk usage. As the amount of data stored increases and the volume of concurrent access rises when using etcd, we need figure out different hardware specifications for different scenarios. We recommend a dedicated host machine with at least a four-core CPU, 8 GB of RAM, SSD for storage, and high-speed, low-latency network connections, among other specifications. At Alibaba, our etcd deployment is run on powerful hardware for the ultra-large-scale container clusters we host.
etcd is an open-source software that brings together the intelligence of outstanding software developers around the world. In the last year, many contributors come together to integrate performance optimizations to etcd. Below we are going to introduce some of these optimizations and end this section by looking at an etcd storage optimization contributed by Alibaba.
1. Memory index: etcd’s in-memory index relies heavily on lock mechanisms for synchronization, which has a significant impact on performance. Through the optimization of the lock mechanism usage, the read-write performance was improved. More on this, see github pr .
2. Large-scale lease usage: Lease is the way etcd supports key expiration when TTL runs out. In earlier versions, the lease scalability was less than ideal, meaning that if leases were used in large numbers, performance would drop significantly. This problem was solved by tweaking the algorithm for lease revoke and lease expiration. You can read more at github pr .
3. Backend boltdb optimization: etcd uses boltdb to store kv and optimizing it would have a major impact on the overall performance of your etcd deployment.
Related to all of this, boltdb can be optimized for different hardware deployments and workloads by adjusting the batch size and interval. On top of that, a new fully concurrent read feature is used to optimize boltdb tx read/write lock, which greatly improves the read/write performance.
To end the section on software optimization, I’d like to present an algorithm developed by Alibaba, specifically a freelist allocation and recovery algorithm based on segregated hashmap.
The following figure shows the architecture of an etcd node. etcd uses boltdb to persistently store all kvs. The performance of the boltdb plays an essential role in the overall performance of the etcd node.
Alibaba uses etcd to store metadata on a large scale. While using etcd, we discovered a performance issue with boltdb. Here are the details:
The preceding figure shows how etcd allocates and recovers storage space. By default, etcd stores data in 4 KB pages. The numbers in the figure indicate the page ID. Red indicates that the page is in use, and white indicates it is not. When a user deletes data, etcd does not return the storage space to the system. Instead, etcd maintains a page pool internally to improve reuse performance. This page pool is called the freelist. When etcd needs to store new data, it normally would perform a linear scan of the freelist, which has a time complexity of
o(n). When the data volume is too large or internal fragmentation is severe, performance will drop sharply.
Therefore, we have redesigned and implemented a new algorithm for freelist allocation and recovery based on segregated hashmap. This new algorithm decreases the time complexity of allocation from
o(1) and that of recovery from
o(1), representing a quantitative leap in etcd performance and greatly improved the ability of etcd to store data. This increases etcd storage capacity by a factor of 50, from the recommended 2 GB to 100 GB, and improves read/write performance by a factor of 24. The official Cloud Native Computing Foundation blog wrote about this optimization. If you're interested, you can check it out.
etcd performance optimization wouldn’t be complete without some work on the client end. Implementing the following best practices will help to ensure your etcd clusters run efficiently and in a stable manner.
1. Avoid large values when you perform put operations. Large values have a major impact on etcd performance. For example, pay attention to the use of CRDs in Kubernetes.
2. Avoid creating keys and values that change constantly, such as occurs when updating node data in Kubernetes.
3. Avoid creating a large number of lease objects. Rather, try to reuse leases that are about to expire, such as by managing event data in Kubernetes.
More Efficient etcd Management
As a distributed key-value database based on the Raft protocol, etcd is a stateful application. Tasks such as managing etcd clusters, maintaining etcd nodes, performing cold/hot backup, and fault recovery are rather complex and require a certain degree of etcd kernel expertise. It is quite challenging to run etcd clusters efficiently.
There are tools out there, such as etcd operator, that can help you create, configure, and manage etcd clusters on Kubernetes. However, these tools can be hit or miss due to their poor versatility, poor integration, steep learning curve, and most importantly, their instability.
To address these problems, we have developed Alpha, an etcd management and maintenance platform based on etcd operator to suit Alibaba’s use cases. Alpha can help users manage and maintain etcd clusters efficiently and perform tasks that would otherwise require multiple tools. This means a single person can manage and maintain thousands of etcd clusters.
The following figure shows the basic functions of Alpha:
As shown in the preceding figure, Alpha is capable of etcd lifecycle management and data management.
Lifecycle management relies on the declarative CustomResource definitions provided by the operator. The process of creating and destroying etcd clusters is now streamlined and transparent, and you do not need to configure each etcd member separately. Simply specify a few fields, such as the number of members, member version, and performance parameters, and you’re ready to go. There are also helpful features such as etcd version upgrade, faulty nodes replacement, and cluster instance start or stop. These help to automate common etcd O&M operations and improve stability when changes are made to etcd clusters.
Of course, we haven’t forgotten about data, the most valuable part of any etcd deployment. Alpha supports scheduled cold backup and real-time hot backup. Backup copies can be hosted on local disks and cloud-based OSS. You can also quickly restore etcd clusters from backups. Aside from backup, Alpha supports data scanning and analysis as well. It can detect the number of hot keys and storage capacity, providing the basis for etcd multi-tenant support. Last but not least, Alpha also supports junk data cleanup and cross-cluster data transfer.
These functions provide a lot of flexibility in Kubernetes cluster management. For example, if user A has an etcd cluster on a private Kubernetes cluster or the cloud and wants to move to another vendor, the user can use Alpha to move its ledger data and other critical data.
With Alpha, we make managing and maintaining etcd transparent, automated, and web-based, reducing the required manpower and increasing efficiency.
More Stable etcd
Now let’s talk about how to make etcd more stable. Cloud container platforms are highly dependent on etcd and its service quality and stability can make or break a cloud platform. Its importance cannot be overstated. Let’s look at some common issues and risks users are likely to encounter when using etcd. As shown in the following figure, these can be divided into three main categories:
- etcd: OOM, bugs, panic, and so on.
- Host: hardware fault, network fault, inference from other processes on the same host.
- Client: bugs, mistakes by personnel, DDoS, and so on.
To address these risks, we can start by:
- Establishing a comprehensive monitoring and alert mechanism that covers client input, etcd, and host environment status
- Completing Client operation audits, so that high-risk operations such as data deletion can be controlled and throttled
- Performing data governance, to analyze client misuse and promote best practices
- Performing scheduled cold backup and multi-site redundancy through hot backup to ensure data security
- Normalizing fault drills and preparing fault recovery plans
Summary and Looking Ahead
In this article, we have discussed some of the ways how we made etcd faster, more stable and more efficient. In the future, we will also work on making etcd more intelligent. Here, we can only give a brief and relatively simplified outline. However, at the end of the day, all that matters is that a more intelligent etcd means more intelligent management and less human intervention. For example, the system should be able to deal with some issues on its own.