Getting Started with Kubernetes | etcd Performance Optimization Practices
By Chen Xingyu (Yumu), Technical Expert at Alibaba Cloud on Basic Technology Mid-Ends
Etcd stores key metadata on a container cloud platform. Alibaba has been using etcd for three years and it assumed a critical role during the 2019 Double 11 Global Shopping Festival. This article introduces our best practices for optimizing etcd server performance and using the etcd client. We hope to help you run etcd clusters stably and efficiently.
1. etcd Overview
Etcd was developed by CoreOS using Golang. It is a distributed key-value storage engine. etcd can be used as a database to store the metadata of a distributed system. etcd is widely used by major companies.
The following figure shows the basic architecture of etcd.
A cluster has three nodes: one leader and two followers. Each node synchronizes data by using the Raft algorithm and stores data in BoltDB. When one node fails, other nodes automatically elect a new leader to maintain the high availability of the cluster. The etcd client can complete a request by connecting to any node.
2. etcd Performance
The preceding figure shows a standard etcd cluster architecture. An etcd cluster can be divided into the Raft layer (blue) and the storage layer (red). The storage layer is further divided into the treeIndex layer and BoltDB layer for persistent key-value storage. Each of these layers may cause performance loss on the part of etcd.
The Raft layer synchronizes data through a network. The etcd performance may be affected by the round trip time (RTT) and bandwidth between I/O nodes in the network. Write-ahead logging (WAL) may be affected by the disk I/O write speed.
At the storage layer, the etcd performance may be affected by disk I/O fdatasync latency and lock blocks of the treeIndex layer. The etcd performance may be greatly affected by the BoltDB Tx lock and the BoltDB performance.
Further, the etcd performance is affected by the kernel parameters of the etcd host and the latency of the gRPC API layer.
3. etcd Server Performance Optimization
The following shows how to optimize the etcd server performance.
Hardware Deployment
The etcd server requires sufficient CPU and memory resources to keep etcd running. etcd is a disk I/O-dependent database program that requires solid state disks (SSDs) with low I/O latency and high throughput. etcd is also a distributed key-value storage system that requires good network conditions to run properly. Therefore, deploy etcd independently from other programs running on the host to prevent their impact on etcd performance.
For more information about the official configuration of etcd, click here.
Software
The etcd software is divided into several layers. The following shows how to optimize the etcd performance at these layers. To obtain the related code, visit GitHub PR.
- Optimize the memory index layer of etcd. Specifically, optimize the internal lock usage to reduce the wait time. The original implementation traverses internal locks. B-Tree uses a coarsely grained internal lock, which greatly affects etcd performance. The optimized lock reduces the performance impact and latency.
For more information, visit the following link:
- Optimize the lease scope. Specifically, optimize the lease revoke routine and the expiration check algorithm to reduce the time complexity of expired list traversal from O(n) to O(logn). This solves the problem of extensive lease usage.
For more information, visit the following link:
- Optimize the backend BoltDB. Specifically, adjust the backend batch size limit and interval for dynamic configuration based on different hardware and workloads. These parameters were previously set to fixed conservative values.
- Optimize fully concurrent read. Specifically, optimize the calling of the BoltDB Tx read and write locks to improve the read performance. This optimization was made by a Google engineer.
New Algorithm for Allocating and Reclaiming etcd Internal Storage in the Freelist Based on the Segregated Hashmap
The following introduces a performance optimization made by Alibaba. This performance optimization significantly improves the internal storage performance of etcd through a new algorithm for allocating and reclaiming etcd internal storage in the freelist based on the segregated hashmap.
The preceding figure shows a single-node etcd architecture, in which BoltDB persistently stores all key-value data. The BoltDB performance is essential for the overall performance of etcd. A large amount of Alibaba metadata is stored in etcd. This exposes some of etcd’s performance problems.
The preceding figure shows a core algorithm for allocating and reclaiming the internal storage of etcd. By default, etcd uses 4 KB pages to store data. As shown in the figure, the numbers indicate the page IDs. Pages in red are being used, whereas pages in white are not in use.
When data is deleted, etcd does not immediately return the storage space to the system, but keeps it in a page pool. This makes it more efficient to reuse the storage space. This page pool is called freelist. As shown in the figure, the freelist keeps pages 43, 45, 46, 50, and 53 in use, and also keeps unused pages 42, 44, 47, 48, 49, 51, and 52.
When new data needs to be stored on consecutive pages with a size of 3, the old algorithm scans from the freelist header and returns the start page ID 47. The linear freelist scanning algorithm has low performance when there is a large amount of data or a lot of internal fragments in the freelist.
To solve this problem, we have designed and implemented a new freelist allocation and reclamation algorithm based on the segregated hashmap. The algorithm uses the consecutive page size as the hashmap key, and the value is the configuration set of the start page ID. When data needs to be stored on new pages, you only need to query the hashmap value with time complexity O(1) to quickly get the start page ID.
When data needs to be stored on consecutive pages with a size of 3, you can query the hashmap to quickly get the start page ID 47.
We also optimized the page release process by using the hashmap. For example, when pages 45 and 46 are released, related pages are merged with the previous and next pages to form a large continuous page starting from page 44 and with a size of 6.
The new algorithm reduces the time complexity of allocation from O(n) to O(1) and that of reclamation from O(nlogn) to O(1). etcd no longer imposes limits on the read and write performance of its internal storage, and the etcd performance is improved dozens of times over. The recommended storage for a single cluster is scaled up from 2 GB to 100 GB. This optimization is currently used within Alibaba and is available to the open-source community.
These software optimizations are all available in the new etcd version.
4. etcd Client Performance Optimization
The following introduces the best practices for ensuring optimal etcd client performance.
The etcd server provides the following APIs to the etcd clients: Put, Get, Watch, Transactions, and Leases.
We use the following best practices when calling these APIs on the etcd client:
- Avoid using large values when calling the Put API and simplify the call process. For example, use the Put API in CustomResourceDefinition (CRD) of Kubernetes.
- etcd applies to the storage of infrequently changed key-value metadata. Avoid creating frequently changed key-value data on the etcd client. This practice is observed by new nodes in a Kubernetes environment when uploading heartbeat data.
- Avoid creating many leases. Reuse leases whenever possible. This practice is observed in event data management in Kubernetes. Events with the same time to live (TTL) are managed through existing leases, rather than by creating leases.
Observe the preceding best practices when using the etcd client to ensure that your etcd cluster runs stably and efficiently.
Summary
Let’s summarize what we have learned in this article.
- The potential performance bottlenecks that affect etcd
- How to optimize etcd server performance in terms of hardware, deployment, and internal core software algorithms
- The best practices for using the etcd client
I hope that this article can help you run your etcd cluster stably and efficiently.