Optimizing X-Engine’s In-memory Read Performance

16 min readJan 6, 2021

By Gushen and Beilou

Background

Despite the shared LSM-tree structure, X-Engine and traditional LSM-tree-based engines such as RocksDB differ in their design philosophies, as shown in the following figure.

Key design 1: Data on X-Engine disks are usually stored at two layers, L1 and L2. The L0 layer is enabled only when data in MemTable fails to be compacted promptly and is temporarily stored on disks to relieve the memory pressure. Under normal circumstances, the frozen MemTable can be merged with L1 of the disk.

Key design 2: During the compaction between L1 and L2, the hot and cold data merging algorithm of X-Engine tends to store frequently accessed data at L1, and store less accessed data at L2 for compression and storage. This is a process of physically separating hot and cold data. After this process, L1 stores hot data only and L2 stores cold data only, ensuring higher memory usage during L1 caching.

According to the design intention, when X-Engine runs normally, the MemTable caches the recently written data that has not been flushed to the disk, and L1 stores the most frequently accessed data on the disk, most of which is cached in the memory. After layered data separation, optimizing X-Engine’s read performance can be divided into two independent sub-problems:

Data reads in the MemTable memory and at L1 are CPU-bound. Common optimization methods include optimizing the execution efficiency of CPU instructions and speeding up main memory access.
Read performance of L2 depends on the random-read capability of disks. To improve L2 read performance, we employ more precise identification of cold and hot data to maximize input/output operations per second (IOPS).

In view of the smaller data amount in the MemTable, the overall performance of X-Engine depends on the memory query efficiency of the data stored at L1, on the premise that the cold and hot data identification algorithm is accurate and the memory is sufficient to cache all the hot data. This brings out the topic of this article: how to maximize the read performance of memory hits.

Challenges in the Read Path

When the dataset size is less than the free memory capacity, the read performance of X-Engine depends on CPU resources. When we analyze the performance of a CPU-bound program, we need to identify the CPU-resource-consuming items. Based on the definition of CPU usage, the CPU is in the busy state when it is executing an instruction or when it stalls waiting for data. Therefore, the read performance optimization of the storage engine is narrowed down to two points.

The first point is improving the CPU cache hit rate, which relies on improving the spatial and temporal locality of memory access by execution threads.
The other point is improving execution efficiency of CPU instructions, which requires streamlining implementation code to remove unnecessary execution paths, and reducing status synchronization between threads to ensure smooth execution of the instruction pipeline.

The process of organizing and reading data at L1 is as follows. First, X-Engine divides the data into extents, each with a size of 2 MB. In the extents, records are encoded into 16-KB blocks, and each extent contains an index block to help locate data blocks. Overall, data organization at the L1 and L2 layers in X-Engine is an index structure similar to a B+ tree. We suppose that all operations hit the memory and a record with a key of X is read in the extent. These operations are performed in four steps.

The ExtentMeta array is the root node of this B+ tree, and the extent to which the array belongs can be located by using binary search.
An extent can be seen as a subtree. First, obtain the index block of the extent by running a hash-based search (querying the cache).
Next, identify the key of the data block to which X belongs from the index block, and locate the data block to which X belongs by running a hash-based search.
Then, find the actual content of the record in the data block in the memory, and return the value that corresponds to the record.

Based on the preceding logic of implementing the read path and a Perf analysis on X-Engine’s read processes of in-memory hits, we tried to make three improvements. (1) Optimize the encoding and searching instructions of data pages. (2) Reduce the management overhead of the buffer pool. (3) Optimize the cache flushing problem of multi-threaded operation on multiple cores. With these improvements, we managed to record a 124% improvement in the overall read performance.

In the next section, we will describe the root causes of the three problems and our solutions in detail. Meanwhile, we will evaluate the benefits of the optimization methods in each step through tests. In view of the generality of the three problems in database storage engines, these optimization methods are also applicable to InnoDB, RocksDB, and other engines.

Block Coding and SIMD

Data Page Coding and Problems

Like in the RocksDB or InnoDB engine, index pages and data pages in X-Engine adopt the same formats. Each block stores key-value pairs in sequence, and performs segmented key prefix compression, where each segment is called a restart interval. Each key-value pair is followed by a restart array, which stores the offset of each restart interval in the block. During a search, the engine first performs a binary search in the restart array to locate the smallest restart interval that may contain the target key, as shown in figures 1, 2, 3, and 4. Then, the engine starts the comparison in sequence from this restart interval, as shown in figure 5.

The search process is essentially a binary search in a pointer array. The restart interval offset stored in the restart array can be understood as a pointer to the restart interval. Perf analysis shows that this process has the following problems.

Memory access hops between the preceding key-value pair and the restart array. The restart array stores the restart interval offset only. During a search, the engine needs to access the location of the block content based on the offset to obtain the key. However, these hoppy-style memory access attempts are irregular in that the distance between each restart array and the destination key is random. In addition, these access attempts are separated by multiple cache lines, making it difficult to fully utilize the CPU cache and resulting in memory access overhead.
It is difficult to utilize the CPU cache in the two comparisons. Except the last comparison, the intervals between the current index key and the index key in the previous comparison may exceed one cache line. This makes it difficult to utilize the CPU cache in the previous comparison and results in high memory access overhead.

Cache-friendly Block Coding

To solve the CPU cache unfriendliness in binary searches of pointer arrays, we tried to convert the frequently accessed data pages into a two-layer B+ tree structure that is CPU-cache-friendly, and cluster time-adjacent records of access in the physical space. The following figure shows the logical storage model.

The following figure shows the physical storage model.

Each node in a block is built in sequence based on the physical storage model, and data query follows a method similar to that of a B+ tree. The CPU-cache-friendly two-layer B+ tree structure improves CPU cache utilization from three perspectives.

It guarantees the consistency of the spatial locality of memory organization with the temporal locality of access sequence. It adopts B+ tree as the index structure, stores key-value pairs in the B+ tree, and implements sequential searches in the node. In a B+ tree, adjacent nodes in access sequence are stored in the same node, and key-value pairs are stored in the node to avoid memory address hopping. Sequential search further ensures the consistency between temporal locality and spatial locality, and between the access sequence and the prefetch sequence. This ensures that prefetch operations are performed at the appropriate time as much as possible. The search algorithm of the B+ tree structure has the same time complexity as that of the binary search algorithm. Therefore, this algorithm has similar computing overhead for sequential search to that for binary search, because the number of keys in a node is relatively small.
It increases the ratio of the cache size to the data size by compressing storage meta information. For example, it uses a 2-byte format to store the offset instead of the leaf node pointer.
It selects the appropriate number of B+ tree layers. We set the number of B+ tree layers to 2, and fanout means to take the square root of N. Each layer of nodes generates at least one cache miss. Too many layers may lead to more cache misses, whereas the one-layer B+ tree may have a large node, and the computing overhead and CPU caching efficiency are almost the same as those of the restart array. When the number of layers is 2, the CPU cache’s support of in-flight load requests and prefetches can reduce the number of cache misses at each layer of nodes to only one to two, making the two-layer structure the most efficient choice for CPU caching.

This encoding format can well support range queries, and enables point-query-like acceleration for the first seek operation of a range query, without any negative impact on subsequent SCAN iterations. For more information about the optimization of B+ tree layers, see the previous article under this WeChat public account: Let Database Engines Bring out the Best of CPU Cache.

SIMD Instructions Accelerate Long Key Comparison

After the cache access problem for data page record queries is solved, further Perf analysis shows that the compare function in the query path takes more CPU cycles, resulting in even more prominent computing overhead for long key queries. In the new data page encoding structure, multiple index keys that need to be compared are simultaneously loaded to the CPU cache in each read. In view of this, the introduction of a Single Instruction, Multiple Data (SIMD) instruction set can realize parallel retrieval of multiple index keys in the cache within one CPU cycle.

To use the SIMD instruction set, we need to further adjust X-Engine’s data page encoding, as shown in the following figure.

Perform lossy compression on the keys. Segment, the processing unit of SIMD registers, is 64-bit long at the maximum, such as an AVX-512 instruction set. Segmenting long keys for parallel comparison cancels out the improved performance of parallel computing due to extra logic. Here, we split a key into three parts, common prefix, significant Nbits, and suffix, to save redundant operations on prefixes for the purpose of segmented parallel comparison. However, significant Nbits is highly distinct in logic, and N depends on the segment size in the SIMD register. In most cases, only one SIMD comparison is needed to obtain the result. If the significant Nbits are equal, you only need to compare the suffixes of some index keys in sequence.

Select a fixed fanout value and apply padding. The processing bandwidth of SIMD instructions, which is the number of segments in the SIMD register, is an exponent of 2. If the fanout value remains the root square of N, the preparation process will include many additional operations. For the purposes of keeping the node length CPU-cache-friendly and adapting to the SIMD instruction processing bandwidth, as well as accommodating the restart key quantity range in the data block, we finalized the leaf node fanout value to 8 and the root node fanout value to 16. For the nodes with insufficient numbers of nodes, the largest key in the node applies as the padding of the fanout value. In this case, the SIMD instructions are hard-coded to reduce branch prediction overhead.

Recoding index pages and data pages and accelerating SIMD instructions in combination can improve the point query performance by up to 20% in a test with different row lengths. In the TP system, the data size read or written by a single request is very small. In our test scenario, only one record is operated at a time. Therefore, the data page encoding and SIMD instruction optimization together improve the end-to-end performance by markedly 20%.

Cost for Buffer Pools and PointSwizzling

Management Cost of Buffer Pools

In-memory databases and on-disk databases have different design principles. A major difference lies in that during data page access, an in-memory database tends to use the pointer for direct access, whereas an on-disk database generally obtains the memory address of the data page in a data structure by using the page ID and reads the disk first when the data page is not cached before it finally accesses the data page. This indirect access leads to additional search overhead. Previous studies have shown that the buffer pool management structure consumes more than 30% of CPU resources when the dataset can be fully stored in the memory.

The virtual memory address space of the operating system is sufficient relative to the dataset size of a database. One possible solution is to use pointers for direct access in an on-disk database engine. When the memory space is insufficient, the swap mechanism of the operating system can be employed to replace memory pages. In this way, the storage engine has the same performance as an in-memory database when the dataset size is smaller than the free memory capacity, because the overhead for page ID mapping and search has been eliminated.

However, this method has a defect. That is, when the dataset size is larger than the free physical memory of the system, the swap mechanism of the operating system cannot accurately perceive the data access characteristics in the database, and may hence swap a critical memory object to the disk, resulting in performance fluctuations.

X-Engine is a general-purpose storage engine, and the datasets it loads exceed the free memory capacity in most cases. Therefore, you must introduce a buffer pool to manage the access to data pages. However, in our design that highlights precise separation of hot and cold data, most of the queries will hit the memory if the user load features obvious locality. In this case, the 30% CPU resource consumption rate of the buffer pool is quite wasteful and needs to be eliminated.

The extent information, index pages, and data pages in X-Engine are managed by the buffer pool cache. Specifically, the cache manages data by using hash tables and doubly linked lists, obtains data by using hash-based searches, and maintains the sequence information of the least recently used (LRU) linked lists after hits. The overhead is incurred from the following three aspects.

The computing overhead of the hash-based search algorithm. This includes calculating the hash value and searching for a linked list sequentially in a slot in the hash table. Sequential search for a linked list suffers from the problem of CPU cache unfriendliness.
The computing overhead of maintaining the LRU information. After a hit, X-Engine needs to adjust the location of the entry in the doubly linked list, and may need to maintain the pointers of the old pool and the high priority pool.
The overhead for lock contention. In addition to searches, the cache also serves insert and evict operations. Therefore, partitions need to be locked during LRU information queries and maintenance. The lock contention is fierce when the concurrency is high and the access traffic is large.

PointSwizzling

PointSwizzling is a method to optimize the CPU resource overhead caused by the indirect reference of the buffer pool. This method uses direct pointers instead of the page ID + HashMap search method. To tackle the problem of high computing overhead for hash-based searches in the X-Engine cache, we introduce direct access pointers for data access and use cache only for data management. The specific solution is described as follows.

After the target index page, data page, or extent meta is stored into the memory for the first time, we add a direct access pointer to the PointSwizzling target object. In this way, the direct access pointer can be used in the next access without HashMap calculation.

Concurrent access to and reclamation of target memory objects. When multiple accessing threads exist, one thread may trigger cache elimination when another thread is using a memory object. If this is the case, the memory object needs to be eliminated and the space occupied by this memory object needs to be reclaimed. There are two solutions to this problem: epoch-based memory reclamation and reference-counting-based memory reclamation. We have tested both solutions and concluded that the former solution outperforms the latter. For specific epoch-based implementations, see the appendix. The reason of this test result is that though reference counting is an atomic variable, its performance fails to live up to expectations in a multi-core environment where the cache is prone to failures due to cache inconsistencies from frequent updates and reads of the counter.

By using PointSwzzling, X-Engine’s memory data access pattern features a high CPU usage efficiency, just like an in-memory database does. When some data is eliminated or replaced, X-Engine can continue using the buffer pool to ensure a high memory hit rate. With the introduction of PointSwizzling, X-Engine’s point query performance in in-memory hit scenarios across key-value interfaces has been improved by nearly 60%.

Access to Cache of a Multi-core CPU

Data page encoding and PointSwizzling are both the optimization methods for the single-thread operation process. The new encoding method of data pages improves the hit rate of CPU cache, whereas SIMD and PointSwizzling improve CPU efficiency for instruction execution, or in other words, the inter-process communication (IPC) of the program.

After we analyzed the overall execution framework of X-Engine, we found another cause of low CPU cache utilization: multi-threading concurrency.

Multi-threading Concurrency

X-Engine is a storage engine in ApsaraDB RDS for MySQL and PolarDB MySQL. Its execution framework is similar to that of MySQL, implementing a one-thread-per-client or thread pool mode. For such a framework, any thread can access the data in all tables. On today’s servers with 32, 64, or even 128 cores, the highly concurrent execution threads cause the following problems.

Inefficient CPU caching. For example, on a 128-core server, if we concurrently access 1,000 tables, the same memory data will be loaded in the L1 or L2 cache of multiple CPU cores, resulting in low cache utilization. Meanwhile, if an update occurs, the cache consistency across the cores will also result in high overhead. Context switches from thread scheduling between different cores can also cause frequent and extensive CPU cache invalidation.
High overhead of thread synchronization. All memory data is shared, and accessing and modifying the data requires the protection of locks, such as when you perform the aforementioned cache query, insertion, and elimination operations. This translates to high overhead for lock contention. For thread synchronization by performing atomic operations, synchronizing statuses of atomic variables at the L1 or L2 layer of multiple CPU cores requires data transmission and cross-core communication, which is also resource-consuming.

To solve these problems, industry pioneers such as H-Store and VoltDB have provided an optimization method for reference.

Multi-core Shared-nothing Architecture

We made the following modifications to X-Engine’s execution framework.

We set the number of active data-reading threads to the number of available CPU cores and bind specific threads to specific cores for execution to avoid frequent thread scheduling.

We also partitioned all the subtables in X-Engine. Specific table partitions can only be accessed by the threads that run on specific cores. This prevents the transmission of multiple replicas of the same data or the same data between the caches of different cores.

This per-core-level shared-nothing architecture has the following advantages.

High CPU cache efficiency. The same copy of data can only be cached in the L1 or L2 cache of one CPU core, cutting the overhead for cache consistency. No context switches occur for threads. Context switching can cause cache invalidation.

Greatly reduced overhead for thread synchronization. Data is not shared between foreground threads, and inter-thread synchronization does not require locks or atomic operations. This solves cache lock contention, swizzle or unswizzle operations of pointer swizzling, and memory reclamation among other problems.

This method can improve the read-only performance by about 30%. However, this solution has not been fully implemented. On the one hand, we need to further optimize the access to hot subtables. On the other hand, cross-table access is common in transaction engines. If a thread runs a transaction that needs to access multiple tables, the scheduling of the transaction is very challenging.

Experimental Verification

Test Environment

64 threads with 2 sockets
L1 cache: 32-KB data cache or 32-KB instruction cache
L2 cache: 256 KB
L3 cache: 32 MB
RAM: 512 GB. You can run the “numactl — interleave=all” command to disable NUMA.

Test Scenarios

The key is 16 bytes long, the value is 8 bytes long, and there are 32 tables, each of which contains 20 million entries. After data is written to the tables, flush the tables to the L1 layer. Run a pre-reading and pre-heating round to load the data to the cache, and then run the readrandom test.

Test Results

We used the random read performance of the original X-Engine architecture as the baseline and then applied the preceding optimization methods in sequence.

First, we added the new data page coding method and the SIMD instruction search algorithm on data pages. As a result, the performance was improved by 13.8%.

Second, we added the PointSwizzling optimization in the code. As a result, the performance was improved by 90% over the baseline. Finally, we reconstructed the X-Engine execution framework to introduce the multi-core shared-nothing architecture.

As a result, the overall performance was 123% higher than that of the baseline. We also analyzed the changes in the miss rates of the CPU L1 data cache and in the program InterProcess Communication (IPC) during the tests.

To identify the paths in which our optimization methods play the most critical role, we performed a breakdown test on the performance improvements by each optimization method and summarized the findings in the following table. The table shows that the most marked performance improvement was seen when indirect reference to the data block was changed to direct reference. Given the high number of data blocks, and the large hash lookup table in the original implementation, which resulted in low cache efficiency, this improvement is not surprising. However, the number of index and ExtentMeta objects is relatively small, which accounts for about 1/128 of the number of data blocks, so you can implement better search efficiency with a smaller impact on performance.