An In-depth Discussion on the LSM Compaction Mechanism

Introduction

Compaction is critical to systems that use the Log-structured Merge Tree (LSM-Tree) architecture. High-throughput writes due to the log appending mode and constant data flush from the memory to the disk when data exhausts the memory capacity are causing more and more overlap of data ranges and accumulation of data with the same key. Consequently, they have compromised the reading performance and caused space expansion. Therefore, the compaction mechanism was introduced to optimize reading performance and space problems by continuously recycling old data and merging multiple layers into one with periodic background tasks. However, the compaction policy and task scheduling method have become new problems. The seemingly simple functions require a balance among various resources, including space, I/O, CPU, and cache. This article will discuss the following aspects: the compaction policy, challenges, implementation of mainstream LSM systems, and academic research directions.

Compaction Policies

Compaction is used for garbage collection (GC) and merge sort on data, which is necessary for an LSM-Tree system to function. However, the compaction tasks lead to a lot of overhead, such as CPU resources consumed by data compression, decompression, copying and comparison, and also the disk I/O of data reads and writes. Therefore, compaction policies are needed to constrain the shape of the LSM-Tree and determine the files to be merged, the size of different tasks, and their trigger conditions. Different policies have different impacts on read-write amplification, space amplification, and the size of the temporary space. Generally, the system will support several policies and have multiple adjustable parameters, which can be selected according to application scenarios.

Basic Concepts

Read Amplification

  • The following analysis uses the case that the same batch of data is repeatedly updated or deleted. Assume that the size ratio of adjacent levels is T. When the number of runs in each level reaches T, the merging of the entire level is triggered. Then, space amplification is O(T). In actual operations, the temporary space needs to be considered and reserved to accommodate the amplification. In the Scylla test, when T is set to 4, the maximum space usage of 1.2 GB data reaches 9.3 GB, which shows that the tiered policy has a rigorous space requirement.
  • Data is written chronologically, and only a small amount of data does not comply with this order.
  • Data can be deleted only based on the time-to-live (TTL) index or by removing the entire partition.
  • The speed of data writing is almost constant.
  • Data queries often fall into a specific partition, such as a query for “values from the last hour, last day, or last week”.

Key-value Stores in the Industry

Every system selects policies and scheduling mechanisms centered on “how and when”, which implies how does the policy select files and when the selection is triggered.

RocksDB

RocksDB supports many ways of compaction and has also made a lot of optimizations. In addition to the classic tiered and leveled policies, RocksDB has two hybrid modes, leveled-N and tiered+leveled.

Scylla

Scylla supports the tiered and leveled compaction policies. The default policy is the size-tiered compaction policy for write optimization scenarios. Scylla also supports the incremental compaction policy as an optimized size-tiered policy and the date-tiered compaction policy that focuses on time-specific scenarios.

HBase

The compaction type in HBase is divided into minor and major. Similar to Bigtable, minor compaction selects from the store files and merges some of them. Hence, the resulting data volume is low and the merge frequency is high. In addition, multi-version data is not recycled to avoid problems that involve the processing of the same keys. Major compaction merges all store files into one; hence, this compaction mode involves a large amount of data and is usually triggered at a large interval (usually in days) or is triggered manually. Major compaction additionally handles expired data.

Challenges

The compaction task has a significant impact on performance, which involves two aspects: the consumption of I/O and CPU resources in the compaction process and batch cache failures after the compaction. In addition, the LSM-Tree structure has incurred another tricky problem that the entries to be deleted cannot be removed immediately. For example, the entries to be deleted may exist on every level, so they can only be eliminated by full compaction, which consumes a lot of resources.

Resource Consumption

When a compaction task runs, many of its operations, such as compression, decompression, copying, and comparison, will consume CPU resources. At the same time, some other operations will consume I/O resources, such as data reads and writes.

Cache Failure

When a compaction task is completed, old data files become invalid, so do data in the cache. The larger the compaction task is and the hotter the data is, the more severe the cache failure becomes. If this is the case, the cache failure leads to an increase in the cache miss rate, which will incur massive read I/Os. These read I/Os will further compete with the read I/Os generated by other compaction tasks for resources and exacerbate the deterioration of read performance.

Delete Entries

In an LSM-Tree, it writes a new entry for a deletion or update operation. Being different from an update operation, a deletion operation is usually distinguished by a one-bit flag. Take deletion as an example, the accumulation of entry deletion increases the space amplification and write amplification. Even worse, the amplification will affect read performance greatly in range-delete scenarios. Currently, to remove deleted entries, the system needs to trigger full compaction, which is resource-consuming and costly. RocksDB selects files for merging according to the distribution of deletions, in order to eliminate deletion-related records and avoid redundant data merging. This is an optimization method. However, if the deletions span various files, full compaction is still inevitable.

Academic Research

Compaction Management of Distributed KV Stores

In this paper (Reference 1), the author proposed to perform offload compaction on a separate server and use incremental cache backfilling to resolve resource consumption and cache failure in distributed KV stores. The implementation and evaluation of offload compaction are based on HBase. HBase is derived from Bigtable. Therefore, their architectures are similar. HBase uses ZooKeeper to maintain consistency and evenly splits data into multiple region servers according to the key range, and its data is persisted in HDFS. HBase maintains memstores in region servers to accommodate new data writes and caches data files from HDFS in block cache.

Offload Compaction

In the HBase architecture, two new components are added: compaction manager and compaction servers. Compaction servers is dynamically added or removed. Like region servers, compaction servers read data from HDFS and then write the data back to HDFS after merging. The compaction manager is similar to the HBase master. It manages a set of compaction servers and the mappings from region servers to compaction servers. Compaction tasks of region servers will be offloaded to compaction servers, and the merged data received by compaction servers will be used to preheat the cache.

Remote Cache

After the compaction in the region server is offloaded to the remote server, the compaction data is synchronized to the memory of the remote server. The access request for this data either reads from a local disk or reads from the memory of a remote server. This is actually a choice between disk I/O and network I/O. The test result does not show obvious improvement by accessing the remote cache.

Smart Cache

As accessing the remote cache does not improve the performance, the author proposed to warm up the local cache by using remote cache increments. The method replaces outdated data that corresponds to a specified key range in order to avoid the elimination of large amounts of data in the cache. Therefore, the requested data may either fall into the old cache or the newly replaced cache. There is little chance that the data had been evicted from the cache and had not been filled back in. Needless to say, this method works pretty well, and almost found no cache miss.

PebblesDB

This paper (Reference 2) proposes an improved structure of LSM-Tree called FLSM. It is designed to reduce write amplification by reducing rewrite operations in the compaction process. In addition, a high-performance KV store named PebblesDB is built based on FLSM. The design concept of FLSM comes from the combination of the skip list and LSM-Tree. Write amplification actually comes from data rewrite operations in the compaction process. In addition, write amplification reduces the life span of storage devices that have limited erasure times, such as SSD disks. Hence, write amplification increases storage costs and reduces write throughput.

Dostoevsky

In detail, this paper (Reference 3) analyzes the I/O complexity of various operations under different tiered and leveled policies. It uses the level num, adjacent level size ratio, the number of entries, buffer size, and a series of related variables to derive the specific I/O cost formula and space amplification for different operations. For point lookups, the influence of a bloom filter is taken into account. For range queries, short and long queries are analyzed respectively. In addition, lazy leveling, a combination of tiering and leveling, was proposed to optimize read performance. In lazy leveling, the largest level works in leveling mode and the other levels work in tiering mode. It is suitable for mixed workloads including updates, point lookups, and long-range lookups. Tiering is suitable for update-intensive workloads, whereas leveling is suitable for lookup-intensive workloads. To adapt to different workloads, the merge frequency is adjustable under different policies, including tiering, leveling, and lazy leveling. This mechanism is called fluid LSM, which is similar to leveled-n in RocksDB. In addition, Dostoevsky also proposed to achieve the maximum throughput by dynamically calculating the optimal configuration during execution.

Reflection

The cost of compaction on a single machine is inevitable, but different policies will lead to different read-write and space amplification results. This is also one of the academic research directions in LSM-Tree systems in recent years. The focus is on how to optimize the worst-case by properly selecting policies. Most systems support multiple compaction policies and different parameters to adapt to different business scenarios. This is a continuously improved process that tries to solve problems as they emerge. Though most of the time it works, extreme situations in complex scenarios or under mixed workloads are inevitable. Therefore, compaction based on parameter adjustment or manual operations is not elegant and efficient. In distributed scenarios, policies are regarded as a black box. Many problems are solved by external resources, and more importantly, they rely on scheduling policies.

References

1) https://www.scylladb.com/2016/08/30/date-tiered-compaction-strategy/
2) Compaction management in distributed key-value datastores
3) PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
4) Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree-Based Key-Value Stores via Adaptive Removal of Superfluous Merging
5) Lethe: A Tunable Delete-Aware LSM Engine
6) https://github.com/facebook/rocksdb/wiki/Compaction
7) https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html
8) http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store