When Databases Meet FPGA — Achieving 1 Million TPS with X-DB Heterogeneous Computing

Overview of Alibaba’s X-Engine

  1. High transaction throughput and low latency in write and read operations.
  2. Write operations make up a relatively high proportion in comparison to that of traditional databases; the read to write workload ratio usually is more than 10:1. However, the number for Alibaba’s transaction system reached 3:1 on the day of the 2017 Double 11 shopping carnival.
  3. Data access hotspots are relatively concentrated. A newly written data record will be accessed mainly (99%) within the first seven days, and the possibility it may be accessed later is extremely low.
  1. The current acceleration solutions are designed for the SQL layer; FPGA is generally placed between storage and host and is used as a filter. Although, researchers have made numerous attempts to use FPGA to accelerate the OLAP system, the FPGA acceleration design for the OLAP system remains a challenge.
  2. While FPGA’s chip size is getting smaller and smaller, FPGA’s internal errors such as single event upset (SEU) pose greater and greater threats to FPGA reliability. For a single chip, the probability of internal error is 3–5 years. Therefore, the fault tolerance mechanism design becomes vitally important for systems in need of large-scale availability.
  1. Highly efficient design and implementation of FPGA compaction: Using streamlined compaction operations, FPGA compaction achieves a processing performance 10 times the CPU single-thread processing performance
  2. Hybrid storage engine’s asynchronous scheduling logic design: As FPGA can complete compaction’s link requests in milliseconds, using a traditional synchronous scheduling method will block a large number of compaction threads and cause heavy thread-switching cost. Through asynchronous scheduling, we have successfully reduced the thread-switching cost and improved the system’s engineering availability.
  3. Fault tolerance mechanism design: As limits of entered data and FPGA internal errors may cause a rollback of some compaction tasks, to ensure data integrity, all tasks that have been rolled back by FPGA will be re-executed by the equivalent CPU compaction threads. The fault tolerance mechanism design as described in this article meets Alibaba’s actual business requirements and avoids FPGA’s internal instability.

X-Engine Compaction

FPGA Accelerated Database

System Design

  1. A user submits a request to operate on a specified KV pair (Get/Insert/Update/Delete). In the case of a write operation, a new record appends to a memtable.
  2. When a memtable reaches its maximum size, it turns into an immutable memtable.
  3. The immutable memtable then turns into an SSTable and flushes to the persistent storage.
  1. CPU splits Load SSTables (that need to be compacted from the persistent storage) into multiple compaction tasks at the granularity of data blocks following the metadata, and pre-allocates memory space for computation result of each compaction task. Consequently, it pushes each successfully created compaction task into the Task Queue for FPGA to execute.
  2. CPU reads the status of Compaction Units on FPGA and allocates compaction tasks from the Task Queue to available Compaction Units.
  3. It transmits Input data to FPGA’s DDR via DMA.
  4. A Compaction Unit executes the compaction task and transmits the computation result via DMA back to the host; it attaches a return code to indicate the status of this compaction task (fail or success). Next, it pushes the compaction results of finished tasks to the Finished Queue.
  5. The CPU checks the compaction result status in the Finished Queue. If a compaction task fails, the CPU executes it again.
  6. It flushes the compaction results to storage.

Detailed Design

FPGA-based Compaction

  1. Decoder: In X-Engine, we store a KV in the data block after compression and encoding. The primary function of the Decoder module is to decode KV pairs. Each CU contains 4 Decoders, and a CU support a compression task of a maximum of 4 KV pairs. We need to split the compression tasks that require compression of more than 4 KV by the CPU. Based on our assessment, most compression tasks involve less than 4 KV pairs. We have placed 4 Decoders based on our considerations of performance and hardware resources. Comparing the configuration with 2 Decoders, we’ve increased 100% hardware consumption but obtained 300% performance improvement.
  2. KV Ring Buffer: KV pairs decoded by the Decoder module get temporarily stored in KV Ring Buffer. Each KV Ring Buffer maintains a read indicator (maintained by the Controller module) and a write indicator (maintained by the Decoder module). KV Ring Buffer maintains three signals to indicate the current status: FLAG_EMPTY, FLAG_HALF_FULL, and FLAG_FULL. If FLAG_HALF_FULL is at a low level, the Decoder module will continue decoding KV pairs. Conversely, the Decoder module will stop decoding KV pairs until downstream consumers in the pipeline have consumed the decoded KV pairs.
  3. KV Transfer: This module is responsible for transmitting keys to Key Buffer. Because merging KV pairs only involve comparison of key values, the values do not need to be transmitted. We can track the currently compared KV pairs by using the read indicator.
  4. Key Buffer: This module stores keys of each KV pair that need to be compared. When all keys that need to be compared have been transmitted to the Key Buffer, the Controller notifies the Compaction PE to compare them.
  5. Compaction PE: The Compaction Processing Engine (compaction PE) is responsible for comparing key values in Key Buffer. Comparison results are sent to the Controller, and then the Controller sends a notice to KV Transfer to transmit the corresponding KV pair to the Encoding KV Ring Buffer for the Encoder module to encode them.
  6. Encoder: The Encoder module is responsible for encoding KV pairs from the Encoding KV Ring Buffer into a data block. If the data block reaches its maximum size, then the current data block gets flushed to DDR.
  7. Controller: The Controller acts as a coordinator in CU. Although the Controller is not a part of the compaction pipeline, it plays a key role in each step of the compaction pipeline design.
  1. n all KV lengths, FPGA compaction has a higher throughput than that of a single-thread CPU; this proves the feasibility of compaction offload;
  2. With the increase of KV lengths, FPGA compaction throughput reduces. This is because the lengths of bytes that need to be compared have increased, resulting in the increase of cost for comparison.
  3. The acceleration rate (FPGA throughput / CPU throughput) increases with the value length. This is because when the KV length is short, it requires frequent communication and status checking among different modules; this means a relatively high cost in comparison with normal pipeline operations.

Asynchronous Scheduling Logic Design

  1. The CPU is responsible for building compaction tasks and pushing them into the Task Queue.
  2. A thread pool is maintained to distribute compaction tasks to specified CUs.
  3. When a compaction task is finished, it will be pushed to the Finished Queue.
  4. The CPU will then check the task execution status, and schedule CPU compaction threads to re-execute the failed compaction tasks.

Fault Tolerance Mechanism Design

  1. Data gets damaged during the transmission process: Calculate the CRC values of data before and after transmission, and compare the values. If these two CRC values are inconsistent, it means that the data is damaged.
  2. FPGA internal errors (bit upset): To solve this problem, we have attached an additional CU to each CU. We can compare the computation results of both CUs and any inconsistency in the results will indicate that a bit upset error has occurred.
  3. Input data of a compaction task is invalid: To facilitate FPGA compaction design, we have set a restriction on the length of KVs. The compaction tasks for KVs that exceed the maximum allowable length are identified as invalid tasks.

Experiment Results

Lab environment

  1. CPU: 64-core Intel (E5–2682 v4, 2.50 GHz) processor
  2. Memory: 128 GB
  3. FPGA card: Xilinix VU9P
  4. memtable: 40 GB
  5. block cache 40 GB
  1. X-Engine-CPU: compaction operation executed by CPU
  2. X-Engine-FPGA: compaction offloaded to FPGA for execution

DbBench

  1. n a write-only scenario, X-Engine-FPGA sees a 40% throughput increase. From the performance curve we can tell that when compaction begins, the performance of X-Engine-CPU drops by 1/3.
  2. FPGA compaction has a higher throughput and is faster, so the read path is shortened faster. Therefore, in the read/write hybrid scenario, X-Engine-FPGA throughput increases by 50%.
  3. The throughput in the read/write hybrid scenario is smaller than that of the write-only scenario. Read operations require access to data stored in persistent layers which brings in I/O cost and affects the overall throughput performance.
  4. These two performance curves represent two different compaction statuses. In the left figure, the system performance jitters periodically meaning that the compaction operation is competing with normal transaction handling threads for CPU resources; while in the right figure, X-Engine-CPU’s performance maintains at a low-level meaning that the compaction speed is smaller than the write speed, causing accumulation of SSTables. Compaction tasks are subject to constant scheduling at the backend.
  5. CPU schedules the Compaction tasks. That’s why X-Engine-FPGA’s performance also jitters and the curve is not smooth.

YCSB

  1. On YCSB benchmark, due to the influence of compaction, X-Engine-CPU’s performance decreases by approximately 80%. However, for X-Engine-FPGA, its performance only sees a fluctuation of 20% due to the influence of the compaction scheduling logic.
  2. The check-unique logic introduces read operations. With the increase in pressure testing time, the read path becomes longer, and the performance of both storage engines decreases with time.
  3. In the write-only scenario, X-Engine-FPGA’s throughput increases by 40%. However, with the increase in the read/write ratio, the acceleration effect of FPGA Compaction decreases gradually. When the read/write ratio becomes higher, the write pressure becomes smaller, and the SSTable accumulation becomes slower thus reducing the number of threads that handle compaction tasks. Therefore, X-Engine-FPGA sees a more obvious performance increase in write-intensive workloads.
  4. With the increase in the read/write ratio, the throughput increases. When write throughput is smaller than that of the KV interface, the cache miss ratio is relatively low, thus avoiding frequent I/O operations. With the increase in the proportion of write operations, the number of threads that handle compaction tasks also increases, thus reducing the system’s throughput capability.
  1. With FPGA acceleration, X-Engine-FPGA’s performance improves by 10%–15% when the number of connections is increased from 128 to 1024. When the number of connections increases, the throughput of both systems gradually decreases, because the lock competition of hotspot rows increases.
  2. TPC-C’s read/write ratio is 1.8 : 1. In the experiment, under the TPC-C benchmark, more than 80% of CPU resources were consumed on SQL resolution and lock competition of hotspot rows. The actual write pressure was not very heavy. Based on our observation in the experiment, the number of threads that execute compaction tasks in the X-Engine-CPU is no more than three (a total of 64 cores). Therefore, FPGA’s acceleration effect is not as obvious as the previous instances.

SysBench

  1. X-Engine-FPGA improves more than 40% of throughput performance. Because SQL resolution consumes a large number of CPU resources, the throughput of DBMS is smaller than that of the KV interface.
  2. X-Engine-CPU reaches a balance at a low level. Because the compaction speed is slower than the writing speed, SST files are accumulated, and compaction is constantly scheduled.
  3. X-Engine-CPU’s performance is twice that of InnoDB, which shows the advantage of LSM-tree-based storage engines in a write-intensive scenario;
  4. In comparison with the TPC-C benchmark, Sysbench is more similar to Alibaba’s real transaction scenario. For a transaction system, most queries are data insertion queries and simple point queries and seldom involve range queries. A decrease in hotspot row conflicts causes the number of resources consumed in the SQL layer to decrease. During the experiment, we have observed that for X-Engine-CPU, when more than 15 threads are used to execute compaction tasks, the performance improvement brought by FPGA acceleration is very obvious.

Conclusion

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

iOS Over the Air Translation with PhraseApp

Moving out from MacOs to Linux

Sweepnet : Wide-baseline Omnidirectional Depth Estimation

How To Install Acronis Backup plugin on cPanel / WHM Server

Flutter Widgets Explorer: The Syntax View

Open-source for supercomputers

Garbage Collection of Un-managed Resources in .Net

.NET for Apache Spark ForeachWriter & PostgreSQL

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Run MinIO as Daemon

Streamline IoT projects with Infrastructure as Code

Introducing Cesium Scheduler

Deep Dive into Google’s AlloyDB Architecture for PostgreSQL