Optimizing X-Engine’s In-memory Read Performance

Background

  1. Data reads in the MemTable memory and at L1 are CPU-bound. Common optimization methods include optimizing the execution efficiency of CPU instructions and speeding up main memory access.
  2. Read performance of L2 depends on the random-read capability of disks. To improve L2 read performance, we employ more precise identification of cold and hot data to maximize input/output operations per second (IOPS).

Challenges in the Read Path

  1. The first point is improving the CPU cache hit rate, which relies on improving the spatial and temporal locality of memory access by execution threads.
  2. The other point is improving execution efficiency of CPU instructions, which requires streamlining implementation code to remove unnecessary execution paths, and reducing status synchronization between threads to ensure smooth execution of the instruction pipeline.
  1. The ExtentMeta array is the root node of this B+ tree, and the extent to which the array belongs can be located by using binary search.
  2. An extent can be seen as a subtree. First, obtain the index block of the extent by running a hash-based search (querying the cache).
  3. Next, identify the key of the data block to which X belongs from the index block, and locate the data block to which X belongs by running a hash-based search.
  4. Then, find the actual content of the record in the data block in the memory, and return the value that corresponds to the record.

Block Coding and SIMD

Data Page Coding and Problems

  1. Memory access hops between the preceding key-value pair and the restart array. The restart array stores the restart interval offset only. During a search, the engine needs to access the location of the block content based on the offset to obtain the key. However, these hoppy-style memory access attempts are irregular in that the distance between each restart array and the destination key is random. In addition, these access attempts are separated by multiple cache lines, making it difficult to fully utilize the CPU cache and resulting in memory access overhead.
  2. It is difficult to utilize the CPU cache in the two comparisons. Except the last comparison, the intervals between the current index key and the index key in the previous comparison may exceed one cache line. This makes it difficult to utilize the CPU cache in the previous comparison and results in high memory access overhead.

Cache-friendly Block Coding

  1. It guarantees the consistency of the spatial locality of memory organization with the temporal locality of access sequence. It adopts B+ tree as the index structure, stores key-value pairs in the B+ tree, and implements sequential searches in the node. In a B+ tree, adjacent nodes in access sequence are stored in the same node, and key-value pairs are stored in the node to avoid memory address hopping. Sequential search further ensures the consistency between temporal locality and spatial locality, and between the access sequence and the prefetch sequence. This ensures that prefetch operations are performed at the appropriate time as much as possible. The search algorithm of the B+ tree structure has the same time complexity as that of the binary search algorithm. Therefore, this algorithm has similar computing overhead for sequential search to that for binary search, because the number of keys in a node is relatively small.
  2. It increases the ratio of the cache size to the data size by compressing storage meta information. For example, it uses a 2-byte format to store the offset instead of the leaf node pointer.
  3. It selects the appropriate number of B+ tree layers. We set the number of B+ tree layers to 2, and fanout means to take the square root of N. Each layer of nodes generates at least one cache miss. Too many layers may lead to more cache misses, whereas the one-layer B+ tree may have a large node, and the computing overhead and CPU caching efficiency are almost the same as those of the restart array. When the number of layers is 2, the CPU cache’s support of in-flight load requests and prefetches can reduce the number of cache misses at each layer of nodes to only one to two, making the two-layer structure the most efficient choice for CPU caching.

SIMD Instructions Accelerate Long Key Comparison

Cost for Buffer Pools and PointSwizzling

Management Cost of Buffer Pools

  1. The computing overhead of the hash-based search algorithm. This includes calculating the hash value and searching for a linked list sequentially in a slot in the hash table. Sequential search for a linked list suffers from the problem of CPU cache unfriendliness.
  2. The computing overhead of maintaining the LRU information. After a hit, X-Engine needs to adjust the location of the entry in the doubly linked list, and may need to maintain the pointers of the old pool and the high priority pool.
  3. The overhead for lock contention. In addition to searches, the cache also serves insert and evict operations. Therefore, partitions need to be locked during LRU information queries and maintenance. The lock contention is fierce when the concurrency is high and the access traffic is large.

PointSwizzling

Access to Cache of a Multi-core CPU

Multi-threading Concurrency

  1. Inefficient CPU caching. For example, on a 128-core server, if we concurrently access 1,000 tables, the same memory data will be loaded in the L1 or L2 cache of multiple CPU cores, resulting in low cache utilization. Meanwhile, if an update occurs, the cache consistency across the cores will also result in high overhead. Context switches from thread scheduling between different cores can also cause frequent and extensive CPU cache invalidation.
  2. High overhead of thread synchronization. All memory data is shared, and accessing and modifying the data requires the protection of locks, such as when you perform the aforementioned cache query, insertion, and elimination operations. This translates to high overhead for lock contention. For thread synchronization by performing atomic operations, synchronizing statuses of atomic variables at the L1 or L2 layer of multiple CPU cores requires data transmission and cross-core communication, which is also resource-consuming.

Multi-core Shared-nothing Architecture

Experimental Verification

Test Environment

Test Scenarios

Test Results

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com