OceanBase Did Better than Any Other Database in the TPC-C Benchmark

11.11 Big Sale for Cloud. Get unbeatable offers with up to 90% off on cloud servers and up to a $300 rebate for all products! Click here to learn more.

This article was put together by the core research and development team of OceanBase:
Yang Zhenkun, founder of OceanBase
Cao Hui, technical expert of OceanBase
Chen Mengmeng, senior technical expert of OceanBase
Pan Yi, senior technical expert of OceanBase
Han Fusheng, senior technical expert of OceanBase
Zhao Yuzong, advanced technical expert of OceanBase

Developed by the Transaction Processing Performance Council (TPC), TPC-C is an online transactional processing (OLTP) benchmark that is made specifically for testing transactions such as order creation and order payment in commodity sales. It is an authoritative, industry-standard benchmark for the online transactional processing system of a database.

OceanBase, Ant Financial’s distributed relational database, got the best results out of any database in the TPC-C benchmark, which, naturally, attracted a lot of attention. In this article, the core research and development team of OceanBase present their understanding of this benchmark and tell how OceanBase went a step beyond all the other databases tested.

How Did OceanBase Do So Well on the TPC-C Benchmark?

Test Preparations

How the Test Worked

As defined by the TPC-C benchmark, a test system can be divided into the remote terminal emulator (RTE) and system under test (SUT). In fact, the SUT can be further divided into the web application server (WAS) and the database server from the role perspective.

A database server is the database service provided by each vendor under testing. The RTE plays the role of a user terminal in the test model, which completes tasks such as transaction triggering and response time collection. The TPC-C benchmark requires that each user terminal maintain a persistent connection to the database server. However, database servers cannot support such number of connections in massive warehouses, for example, so, in such cases, the WAS bridges the RTE and the database server. According to the TPC-C benchmark guidelines, a connection pool can be used to manage all requests when the connection pool is transparent to applications.

Among the three roles, the WAS and the database server must be commercially available and provide payment services. OceanBase used OpenResty as its WAS provider. Generally, the RTE is implemented by each vendor under the test according to the benchmark. However, all implementation code must undergo strict auditing by the auditors. OceanBase implemented a complete set of RTEs that were fully compliant and could be deployed in a large test system. During the actual TPC-C benchmark test, the capabilities of the database server are tested, and RTEs need to be able to consume a lot of resources and withstanding a high amount of stress. In this test, we received the tpmC value of 60,880,000 and ran 960 RTEs on 64 64-core 128-GB cloud servers to simulate total 47,942,400 user terminals. Last, we audited and verified the consistency and durability of the RTE statistics.

Although RTEs were only used as clients during a benchmark test, many details might lead to a test failure. For example, the response time of all transactions increased by 100 ms due to the introduction of web terminal simulators, which may not be noticeable at first, and the output display was delayed for 100 ms due to the use of user terminal simulators.

Test Planning

  1. Hardware selection: In addition to the database server model, the RTE and WAS models also need to be determined. A large number of tests and optimizations are required to find the appropriate resource configuration of the server for each role while also ensuring cost-effectiveness. During OceanBase’s benchmark test, only cloud-based resources were used. This offered the biggest benefit for this test because we did not need to concern ourselves with the specifics of the underlying infrastructure, including the data centers, cabinets, and cabling. Instead, we only needed to rapidly adjust the specifications to find the target models.
  2. Parameter selection: It is not easy to determine appropriate configuration parameters. For instance, two typical parameters are the total number of warehouses we need to run, and the number of warehouses that are borne by each database server. To simulate a business scenario that is as practical as possible, the TPC-C benchmark sets different think times and keying times for each transaction to ensure that the data in a warehouse can provide a maximum tpmC value of 12.86. Therefore, a database vendor must load more warehouses to ensure higher performance. However, past experience shows that 80% of the storage space on a single server must be reserved for 60 days’ worth of incremental storage. In the distributed database schema, to make full use of each database server and the local storage and maximize the utilization of the CPU, memory, I/O capability, and storage space of each server, we had made adjustments and optimizations for nearly one month.

Performance Stress Tests

  1. Ramp-up. According to the TPC-C benchmark, each database is allowed to warm up for a certain period of time to be a steady state when the test is performed. However, all the configurations at the ramp-up stage must be consistent with the configurations in the final report.
  2. Steady. When the atomicity, consistency, isolation, and durability (ACID) as well as the serializable isolation are guaranteed, the database can run at the steady state for more than eight hours, with the tpmC value specified in the final report without manual intervention. In addition, the database completes a checkpoint every half an hour.
  3. Measurement interval. According to the TPC-C benchmark, the database needs to run at the steady state for eight hours, but only needs to run for more than two hours at the performance collection state. At this stage, the total tpmC fluctuation must not exceed 2%, and the database must complete at least four checkpoints.

Previously, the general process for performing the performance stress test on a database was as follows: Allow the database to warm up for a certain time so that it can reach the steady state. After the database enters the steady state for a certain period of time and completes a checkpoint, the performance collection stage begins, which typically lasts for two hours.

However, the most rigorous process in the TPC-C test up to this point in the test is the part that tests OceanBase’s computing performance. For this part, OceanBase was first warmed up for 10 minutes. Then, following this, it was run at the steady state for 25 minutes with a tpmC value of 60,880,000. Finally, OceanBase ran for eight hours at the collection stage of the performance stress test. The total test duration exceeded 8.5 hours, and the occurrence of the largest performance fluctuation during this period was less than 0.5%. This final result surprised the auditors.

Before and after the performance test, the auditors also need to check the random distribution of data and transactions, which involves a large number of full-table scans and SQL statistical collections. The most heavyweight SQL statement needs to access the order_line table with more than one trillion rows of items, which can be regarded as a type of TPC-H benchmark test inside of the TPC-C benchmark test. In the field audit, we also encountered many timeout errors during SQL statement execution when these SQL statements were run for the first time. Subsequently, we used the latest concurrent execution framework in OceanBase 2.2 and successfully completed the execution of these heavyweight SQL statements within a relatively short time. A really general relational database has no weaknesses. In this sense, OceanBase still needs to learn from Oracle.

Atomicity, Consistency, Isolation, and Durability (ACID)

  1. The consistency © test in the TPC-C benchmark consists of 12 cases, four of which must be tested. Actually, each case can be seen as a heavyweight and complex SQL statement. For a distributed database, it is important to ensure the global consistency all the time.
  2. In the I test, the TPC-C benchmark requires that the five transactions, except StockLevel in the TPC-C model, reach the highest isolation at the serializable level, and that nine cases need to be constructed to verify the isolation. For a distributed database, it is not easy to meet this requirement. Fortunately, OceanBase versions 2.0 and later support global timestamps. Therefore, as required by the auditors, we conducted the isolation (I) test twice in the fully local and fully distributed modes respectively. It was also the first time that the isolation (I) test had been conducted twice for a database vendor in the TPC-C test and that the I test consisted of 18 cases. In the coming future, the definition of the TPC-C benchmark may be updated for distributed databases due to the test on OceanBase.
  3. In the durability (D) test, OceanBase had a great natural advantage over traditional databases. In OceanBase, each warehouse has two data replicas and three log replicas, which are synchronized over the Paxos protocol. This ensures that the recovery time objective (RTO) is within seconds when the recovery point objective (RPO) is 0.

In the D test, the strictest test item is the permanent failure of certain storage media. The most rigorous test scenario was used for OceanBase. With the largest tpmC value of 60,000,000 that exceeded the requirement of the TPC-C benchmark, we forcibly destroyed a cloud server node. OceanBase restored the tpmC value to 60,000,000 within two minutes and continued running until the test ended. The performance shown during this test had far exceeded the requirements of the TPC-C benchmark. In comparison, other traditional databases often use Redundant Arrays of Independent Disks (RAIDs) to restore data in case of a failure of storage media with logs in the D test. OceanBase was the first database vendor that completed all the test items of the D test by only using the recovery mechanism of the database but not a RAID.

In the D test, we destroyed two server nodes consecutively. We first destroyed a data node. After OceanBase restored the tpmC value and ran at the steady state for five minutes, we destroyed the rootserver leader node. Then, OceanBase rapidly restored the tpmC value again.

Some SQL Optimizations for the TPC-C Benchmark

First of all, we will discuss the five transaction models in the TPC-C benchmark, which are order creation, order payment, order query, order delivery, and inventory query. These five transactions occur at a certain ratio, and the number of order creation transactions that are executed per minute will be finally tested. As we know, each database transaction is composed of several SQL statements associated with a certain logical structure. All of these SQL statements either succeed or fail in a transaction, something which is referred to as “atomicity” in the database, which corresponds to letter “A” in the “ACID” string. But, how much time does a transaction consume during a TPC-C benchmark test? This is clearly specified in the report, and is only dozens of milliseconds. Considering that a transaction is composed of several SQL statements, the average time consumed by each SQL statement is less than one millisecond.

In the client-server (C/S) model, a SQL statement goes through a complete process from input on a client to transmission on a network, optimization, execution, and returning the results to the client. However, an SQL statement may be executed simply to update a field, which requires a short time. When considering the entire process, lots of time is spent on interaction with the client, which results in wasted resources and increased time consumption. But how can this problem be solved? The solution is to use a stored procedure.

Stored Procedure

Figure 1. Execution comparison between the traditional C/S model and a model in which a stored procedure is used.

In TPC-C, a stored procedure is crucial for improving the execution efficiency of the entire system. OceanBase version 2.2 supports stored procedures at all stages, and has also optimized the execution efficiency of the stored procedures.

Compilation and Execution

Array Binding

During array binding, the database first finds the plan to be used, and then executes the plan. After each execution, the database performs array binding again. This is similar to a FOR loop in the C language. New values are assigned repeatedly, but no new data structure is defined. The array binding feature is controlled by users and can be triggered by the FORALL keyword during the execution of the stored procedure. In the TPC-C test, we used the array binding feature for multiple times to improve the processing capability of the system, which achieved significant performance.

Prepared Statement and Execution Plan Cache

OceanBase implements a high-speed cache to store the executable code and SQL execution plans of stored procedures. The system can rapidly obtain the execution objects for stored procedures and SQL statements with different parameters from this high-speed cache, generally within dozens of microseconds, which effectively avoids delay being in the milliseconds and CPU consumption caused by recompilation.

Updatable View

select i_price,i_name, i_data from item where i_id = ?;

UPDATE stock
SET s_order_cnt = s_order_cnt + 1,
s_ytd = s_ytd + ?,
s_remote_cnt = s_remote_cnt + ?,
s_quantity = (CASE WHEN s_quantity< ? + 10 THEN s_quantity + 91 ELSE s_quantity END) - ?
WHERE s_i_id = ?
AND s_w_id = ?
RETURNING s_quantity, s_dist_01,

However, in the case that an updatable view is created, then the following can be done.

CREATE VIEW stock_item AS
SELECT i_price, i_name, i_data, s_i_id,s_w_id, s_order_cnt, s_ytd, s_remote_cnt, s_quantity, s_data, s_dist_01
FROM stock s, item i WHERE s.s_i_id =i.i_id;

Specifically, an SQL statement can be used to update the inventory and obtain the commodity and inventory information.

UPDATE stock_item
SET s_order_cnt = s_order_cnt + 1,
s_ytd = s_ytd + ?,
s_remote_cnt = s_remote_cnt + ?,
s_quantity = (CASE WHEN s_quantity< ? + 10 THEN s_quantity + 91 ELSE s_quantity END) - ?
WHERE s_i_id = ?
AND s_w_id = ?
RETURNING i_price, i_name, s_quantity,s_dist_01,

This eliminates the need for an additional SQL statement and makes the update logic more intuitive. Users can use updatable views as common views, but cannot define all views as updatable views. For example, views that contain DISTINCT and GROUP BY cannot be updated because the semantics do not specify the rows to be updated. As for the “stock_item” view, which is essentially a join between the “stock” table and the “item” table, it is guaranteed by the uniqueness of “item.i_id” that each row from the base table “stock” corresponds to at most one row in “stock_item”, an attribute also known as “key-preserved” in some literature.

Note: The TPC-C benchmark does not allow for materialized views. However, an updatable view does not change the storage schema of the data tables at the underlying layer, and therefore meets the TPC-C benchmark.

The TPC-C benchmark requires that the design reflect the “actual” running of an OLTP system. Many of our optimizations are applicable to various applications. For example, in a highly concurrent OLTP system, most SQL requests consume a short time. If a pure C/S interaction model is used, frequent interactions between applications and the database consume lots of system time. However, stored procedures can reduce the time consumed by these interactions and enhance the network jitters immunity of the system. This core capability is indispensable for a distributed OLTP database.

In the TPC-C test, OceanBase version 2.0 started to be compatible with Oracle, in which all stored procedures and SQL statements used data types and syntax that were compatible with Oracle. In this way, we could constantly optimize the database and develop the database in the general and formal direction.

Challenges Placed on Database Transaction Engines during TPC-C Benchmark Testing

Different hardware imposes different challenges on the software architecture. Proprietary storage devices provide internal reliability by using redundant hardware. When database software uses these storage devices, no data loss occurs due to these pre-existent properties. However, proprietary storage devices are expensive, which leads to high costs.

OceanBase uses common ECS servers to provide database services and only uses the local hard disks delivered with the ECS servers for data storage, which is the most common hardware usage condition. However, individual ECS servers are less reliable than proprietary storage devices, which brings huge challenges on the software architecture. The transaction engine of OceanBase was also challenged because OceanBase achieved the ACID characteristics on common ECS servers.

The TPC-C benchmark has complete and strict requirements for the ACID characteristics of transactions. The following section introduces how OceanBase has implemented the ACID characteristics of transactions.

Ensure the Durability by Synchronizing Paxos Logs

OceanBase uses Paxos to coordinate the durability of Redo logs on the three database servers. Paxos is a consistency synchronization protocol that uses an algorithm that considers success if more than half (the majority) members succeed. When three replicas are available, two successful replicas will be considered as the majority. After Redo logs are synchronized between database servers to ensure the durability, the transactions can be committed. Generally, the third database server can also complete the synchronization of Redo logs to ensure the durability immediately. However, even if the third database server encounters an exception, it does not affect the committing of the transactions. The system will automatically synchronize the missed Redo logs with this database server after it recovers. If the third database server encounters a permanent fault, the system distributes the Redo logs to be originally synchronized by the fault database server to other database servers in the cluster. These database servers will automatically supplement the missed Redo logs and write the latest Redo logs.

The Paxos protocol implements the best balance between data durability and database service availability. When three replicas are used, any unavailable replica does not affect subsequent writes because the other two replicas can provide data and support subsequent writes.

Therefore, OceanBase can also improve the continuous service capability of the database while ensuring the durability of transactions. When the TPC auditors audited the durability of OceanBase on the field, they selected a random server and forcibly powered off the server under constant pressure from the client. From the audit results, they found that the database did not lose any data and could continue to provide services without manual intervention. This impressed the auditors, who highly praised OceanBase.

Ensuring Atomicity by Using the Two-phase Commit Protocol

OceanBase distributes data to multiple servers by warehouse ID (Warehouse_ID). If all transactions occur within the same warehouse, all transaction modifications only involve the data of one server, and all transactions will be committed on this server regardless of the data volume. This scenario demonstrates the ideal linear scalability. However, this is different from the actual business scenario. Actually, most actual businesses require data interactions in different dimensions. The TPC-C benchmark has also considered this aspect and put forward requirements for random rules of transaction data. It requires that the finally generated transactions include 10% of “order creation” transactions and 15% of “order payment” transactions that involve operations of two or more warehouses. In OceanBase, transactions across servers will be generated, whose atomicity must be ensured by the two-phase commit protocol.

OceanBase automatically traces the operation data of all SQL statements in a transaction, and automatically determines the participants of the two-phase committing based on the actual data modification location. Before a transaction is committed, OceanBase automatically selects the first participant as the coordinator. The coordinator sends a Prepare message to all participants. Each participant then needs to write their own Redo and Prepare logs, which means that each participant completes Paxos synchronization on their own. After the coordinator confirms that all the participants have completed the synchronization of the Redo and Prepare logs, the coordinator sends a Commit message to all the participants and waits until all participants complete the committing. OceanBase automatically completes the entire protocol during the committing of transactions, which is fully transparent to users. OceanBase automatically selects a coordinator for each two-phase commit transaction, and any server in the system can serve as a coordinator. Therefore, OceanBase can linearly scale out the transaction processing capability.

Ensuring the Isolation of Transactions by Controlling Multi-version Concurrency

OceanBase has three replicas of the global timestamp generator, which can be independently deployed on three servers. In addition, one replica can be deployed on the root node to share resources with the root node, as it was deployed in the TPC-C test. The three replicas of the global timestamp are deployed in a high-availability architecture. Reading any timestamp will be confirmed on at least two of the three servers. Therefore, upon the failure of any server, reading the timestamp is not affected.

According to the TPC-C benchmark, OceanBase prepared nine different scenarios for testing the isolation of transactions, including the read-and-read scenario and the read-and-write conflict scenario. Finally, OceanBase successfully passed the audit by the auditors.

Ensure the Consistency

Duplicate Table

The TPC-C benchmark does not limit the number of data replicas, but requires that the ACID characteristics of the transactions be guaranteed regardless of the data formats. OceanBase uses a special broadcast protocol to guarantee the ACID characteristics of the duplicate table in all replicas. When the duplicate table is modified, all replicas will also be modified. In addition, when a server fails, the logic of the duplicate table automatically removes the invalid replica so that there is no unnecessary waiting time caused by the server failure during data modification. Duplicate tables can be used in many business scenarios, such as dictionary tables that store key information in various businesses, and tables that store exchange rate information in financial businesses.

4. Storage Optimization for TPC-C Benchmark Testing

Assume that the tpmC value of a system is 1.5 million, which corresponds to 120,000 warehouses. If the data volume of a warehouse is 70 MB, the total data volume will be 8.4 TB. Some vendors use a modified TPC-C test that does not meet audit requirements. This test does not limit the maximum value of tpmC for a single warehouse. In addition, these vendors test the performance of loading hundreds or thousands of warehouses to the memory, which is meaningless and does not help pass the audit. In an actual TPC-C test, the test on storage consumption accounts for a great portion. OceanBase is the first database to have ranked first in the TPC-C test by using the Shared Nothing architecture and the log-structure merge-tree (LSM tree) storage engine architecture. The storage architecture of OceanBase has the following key features:

  1. To ensure reliability, OceanBase stores data in two replicas and logs in three replicas. In the TPC-C test, traditional centralized databases only stored data in one replica.
  2. In the TPC-C test, OceanBase used two data replicas and Alibaba Cloud ecs.i2.16xlarge servers that were the same as those in production systems. Therefore, the storage capacity of solid-state drives (SSDs) became the bottleneck. OceanBase resolved this problem by using online compression, which further increased the utilization rate of the CPU. Accordingly, data was stored in one replica in the test for centralized databases and data compression may not be opened.
  3. The LSM engine of OceanBase needs to regularly perform compaction at the background, but the TPC-C benchmark requires that the database run for at least eight hours with jitters less than 2% in two hours. Therefore, OceanBase needs to resolve the jitters caused by the background operations of the LSM engine.

Two Data Replicas

The primary issue caused by three data replicas is the increased storage cost. Most traditional business databases used disk arrays in the TPC-C test, and the TPC-C benchmark has no requirements for disaster recovery. Therefore, it is not allowed to use storage space that is three times the storage space used by traditional databases in the TPC-C test.

Based on this fact, we only use the Paxos protocol to synchronize logs. Logs will be written into three replicas, but data will only be written in two replicas for disaster recovery in case of single server failure. When a server fails and invalidates one replica, the other replica can supplement the missed data by using logs and continue to provide external access.

Compared with data, logs consume smaller storage space. We defined three types of replicas to separate data from logs: The F replica stores data, synchronizes logs, and provides external reads and writes. The D replica stores data, synchronizes logs, but does not provide external reads and writes. The L replica only synchronizes logs, but does not store data. When the F replica becomes faulty, the D replica takes over the F replica to provide external reads and writes after supplementing the missed data. In the TPC-C test, we deployed the F, D, and L replicas, which used the storage space required by two data replicas. Both the D and L replicas need to play back logs, and the D replica also needs to synchronize data. All these operations consume the network and CPU resources.

Online Compression

To resolve this problem, OceanBase performed online compression for the data in the TPC-C test. In Oracle, the storage capacity of a warehouse is about 70 MB. However, in OceanBase, the storage capacity of a compressed warehouse is only about 50 MB, which is a significant reduction in storage space. According to the TPC-C benchmark, a database must provide disk space for storing the data of 60 days. However, OceanBase must store data in two replicas, which is equivalent to the data of 120 days. Although this provides higher reliability, the higher storage cost also leads to a higher total price.

OceanBase uses 204 ECS cs.i2.16xlarge servers to store data. The server type is the same as that for actual online business applications. The size of the log disk on each server is 1 TB, and the data disk size is nearly 13 TB. We calculated the storage space required by the two compressed data replicas in 60 days, and found that the data disk of each server was almost exhausted. According to the resource cost consumed by the servers, a balance had been reached. After the single-server performance in tpmC of OceanBase is further improved, the disk capacity will be a bottleneck. The OceanBase LSM engine is append-only, which does not allow for random modification and supports online compression. Both for the TPC-C test and core OLTP production systems such as Alipay’s transaction payment system, OceanBase will enable online compression to exchange storage space with the CPU capability.

Smooth Storage Performance

Layered Dump

Resource Isolation

CPU Usage by the Storage Module

  1. In a centralized architecture, the database needs to use the CPU, and proprietary storage devices also need to use the CPU. For example, Oracle ranked second in the TPC-C tests. In the test on the tpmC value of more than 30 million, the database used 108 SPARC T3 processors and involved 1,728 physical cores and 13,824 execution threads. In addition, the storage module used an Intel-core server as the controller with additional 97 servers, 194 Intel X5670 CPUs, and 2,328 physical cores.
  2. A centralized database uses hardware with high reliability, and therefore only needs one replica for storage. However, OceanBase implements fault tolerance by using software. Although the hardware cost is low, maintaining two data replicas and three log replicas uses lots of CPU resources.
  3. OceanBase enabled online compression in the TPC-C test and production systems, which further increased the CPU usage.

Therefore, it is not rational to simply compare the CPU cores used by OceanBase and Oracle. Instead, the CPU cores shared by storage devices, and the CPU overhead caused by multi-replica storage and online compression of OceanBase must also be taken into account. The scheme recommended by TPC-C does not emphasize to specific software and hardware architecture, but focuses on the total hardware cost. In the test on OceanBase, the hardware cost accounted for about 18% of the total cost. Therefore, the cost-effectiveness of OceanBase is much higher than centralized databases when hardware is the only consideration.

Future Development

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.