Multi-Document Transactions on MongoDB 4.0

At the end of last month, MongoDB World announced the release of MongoDB 4.0, which supports multi-document transactions against replica sets. Alibaba Cloud ApsaraDB Database R&D engineers first analyzed the source code of transaction functions and parsed the transaction implementation mechanism. In this article, we will discuss the implementation of multi-document transactions on MongoDB 4.0.

What’s New on MongoDB 4.0

The transaction functionality introduced by MongoDB 4.0 supports multi-document ACID features such as transaction operations using the mongo shell.

> s = db.getMongo().startSession()
session { "id" : UUID("3bf55e90-5e88-44aa-a59e-a30f777f1d89") }
> s.startTransaction()
> db.coll01.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> db.coll02.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> s.commitTransaction() (or s.abortTransaction() rollback transaction)

Other language drivers that support MongoDB 4.0 also encapsulate transaction-related APIs. To do this, users need to create a Session, then open and commit the transaction on the Session. For example:

Python Version

with client.start_session() as s:
collection_one.insert_one(doc_one, session=s)
collection_two.insert_one(doc_two, session=s)

Java Version

try (ClientSession clientSession = client.startSession()) {
collection.insertOne(clientSession, docOne);
collection.insertOne(clientSession, docTwo);


Session is a concept introduced in MongoDB 3.6. This feature is mainly used to achieve multi-document transactions. Session is essentially a "context."

In previous versions, MongoDB only managed the context of a single operation. The mongod service process received a request, created a context for the request (corresponding to OperationContext in the source), and then used the context throughout the request. The content of the context includes request time consumption statistics, lock resources occupied by the request, storage snapshots used by the request, etc.

With Session, you can have multiple requests share a single context, and multiple requests can be correlated to support multi-document transactions.

Each Session contains a unique identifier lsid. In version 4.0, each request of the user can specify additional extension fields, including:

In fact, when users use transactions, they do not need to understand these details because they will be automatically handled by MongoDB Driver. The Driver will allocate the lsid when creating Session. Then the Driver will automatically add lsids to all subsequent operations in this Session. For a transaction operation, the txnNumber will automatically added.

It is worth mentioning that the lsid of the Session can be allocated by the server by calling startSession command, or by the client, so that the network overhead is saved. For the transaction identifier, MongoDB does not provide a separate startTransaction command, txnNumber is directly assigned by the Driver. The Driver only needs to guarantee that within a Session, txnNumber is incremented. When the server receives a new transaction request, it will start a new transaction actively.

When MongoDB startSession, it can specify a series of options to control the access behavior of the Session, including:


Atomicity, Consistency, Isolation, and Durability (ACID) is a set of properties for database transactions aimed at ensuring the validity of transactions under all circumstances. ACID plays an essential role for multi-document transactions.


For multi-document transaction operations, MongoDB provides an atomic semantic guarantee of “all or nothing”. Data changes are only made visible outside the transaction if it is successful. When a transaction fails, all of the data changes from the transaction is discarded.


Consistency is straightforward. Only permissible transactions are allowed on the database, which prevents database corruption by an illegal transaction.


MongoDB provides a snapshot isolation level, creates a WiredTiger snapshot at the beginning of the transaction, and then uses this snapshot to provide transactional reads throughout the transaction.


When a transactions use WriteConcern {j: true}, MongoDB will guarantee that it is returned after the transaction log is committed. Even if a crash occurs, MongoDB can recover according to the transaction log. If the {j: true} level is not specified, even after the transaction is successfully committed, after the crash recovery, the transaction may be rolled back.

Transaction and Replica

In the replica set configuration, an oplog (a normal document, so the modification of the transaction in the current version cannot exceed the document size limit of 16 MB) will be recorded when the entire MongoDB transaction is committed, including all the operations in the transaction. The slave node pulls the oplog and replays the transaction operations locally.

Transaction oplog example, containing lsid, txnNumber of a transaction operations, and all operation logs within the transaction (applyOps field)

"ts" : Timestamp(1530696933, 1), "t" : NumberLong(1), "h" : NumberLong("4217817601701821530"), "v" : 2, "op" : "c", "ns" : "admin.$cmd", "wall" : ISODate("2018-07-04T09:35:33.549Z"), "lsid" : { "id" : UUID("e675c046-d70b-44c2-ad8d-3f34f2019a7e"), "uid" : BinData(0,"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=") }, "txnNumber" : NumberLong(0), "stmtId" : 0, "prevOpTime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "o" : { "applyOps" : [ { "op" : "i", "ns" : "test.coll2", "ui" : UUID("a49ccd80-6cfc-4896-9740-c5bff41e7cce"), "o" : { "_id" : ObjectId("5b3c94d4624d615ede6097ae"), "x" : 20000 } }, { "op" : "i", "ns" : "test.coll3", "ui" : UUID("31d7ae62-fe78-44f5-ba06-595ae3b871fc"), "o" : { "_id" : ObjectId("5b3c94d9624d615ede6097af"), "x" : 20000 } } ] } }

The entire replay process is as follows:

Transaction and Storage Engine

Unified Transaction Timing

WiredTiger has supported transactions for a long time. In versions 3.x, MongoDB uses WiredTiger transactions to guarantee the modification atomicity of data, index, and oplog. But in fact, MongoDB provides a transaction API after iterations of multiple versions. The core difficulty is timing.

MongoDB uses the oplog timestamps to identify the global order, while WiredTiger uses the internal transaction IDs to identify the global order. In terms of implementation, there is no association between the two. This results in a transaction commit order that MongoDB sees inconsistent with the transaction commit order that WiredTiger sees.

To solve this problem, WiredTier 3.0 introduces a transaction timestamp mechanism. The application can explicitly assign a commit timestamp to the WiredTiger transaction through the WT_SESSION::timestamp_transaction API, and then achieve the specified timestamp read (read "as of" a timestamp). With the read "as of" a timestamp feature, when the oplog is replayed, the read on the slave node will no longer conflict with the replayed oplog, and the read request will not be blocked by replaying the oplog. This is a significant improvement in the version 4.0.

* __wt_txn_visible --
* Can the current transaction see the given ID / timestamp?
static inline bool
WT_SESSION_IMPL *session, uint64_t id, const wt_timestamp_t *timestamp)
if (!__txn_visible_id(session, id))
return (false);
/* Transactions read their writes, regardless of timestamps. */
if (F_ISSET(&session->txn, WT_TXN_HAS_ID) && id == session->
return (true);
WT_TXN *txn = &session->txn;
/* Timestamp check. */
if (!F_ISSET(txn, WT_TXN_HAS_TS_READ) || timestamp == NULL)
return (true);
return (__wt_timestamp_cmp(timestamp, &txn->read_timestamp) <= 0);
return (true);

As you can see from the above code, after the transaction timestamp is introduced, the timestamp is additionally checked when the visibility is determined. When the timestamp read is specified for upper read, only the data before the timestamp can be seen. MongoDB associates the oplog timestamp with the transaction when committing the transaction, so that the timing of MongoDB Server layer is consistent with that of the WiredTiger layer.

Impact of Transaction on the Cache

The WiredTiger (WT) transaction opens a snapshot, and the presence of snapshot impacts the WiredTiger cache evict. On a WT page, there are N modification versions. If these modifications are not globally visible (__wt_txn_visible_all), this page cannot be evicted (__wt_page_can_evict).

In versions 3.x, modification of a write request to the data, index, and oplog will be placed in a WT transaction, the transaction commit is controlled by MongoDB, and MongoDB will commit the transaction as soon as possible to complete the write request; but after the transaction is introduced in the version 4.0, the transaction commit is controlled by the application. There may be a lot of transaction modifications, and the transaction may not be committed for a long time. This will have a great impact on the WT cache evict. If a large amount of memory cannot be evicted, it will eventually goes to the status of cache stuck.

In order to minimize the WT cache pressure, the MongoDB 4.0 transaction function has some restrictions, but when the transaction resource exceeds a certain threshold, it will automatically abort to release the resource. The rules include:

Read as of a Timestamp and Oldest Timestamp

Read as of a timestamp relies on WiredTiger to maintain multiple versions in memory. Each version is associated with a timestamp. As long as the MongoDB layer may need to read the version, the engine layer must maintain the resources of this version. If there are too many reserved versions, it puts a lot of pressure on the WT cache.

WiredTiger provides the ability to set the oldest timestamp, which allows MongoDB to set the timestamp, meaning that read as of a timestamp does not provide a smaller timestamp for consistent reads, that is, WiredTiger does not need to maintain all history versions before oldest timestamp. The MongoDB layer needs to update the oldest timestamp frequently (in a timely manner) to avoid putting too much pressure on the WT cache.

Engine Layer Rollback and Stable Timestamp

In versions 3.x, the rollback operation of the MongoDB replica set is done at the Server layer, but when a node needs to be rolled back, the reverse operation is continuously applied according to the oplog to be rolled back, or the latest version is read from the rollback source. The entire rollback operation is inefficient.

The version 4.0 implements the rollback mechanism for the storage engine layer. When the replica set node needs to be rolled back, it directly calls the WiredTiger API to roll back the data to a stable version (a checkpoint). This stable version depends on stable timestamp. WiredTiger will ensure that data after the stable timestamp is not written to the checkpoint. According to the synchronization state of the replica set, MongoDB will update the stable timestamp when the data has been synchronized to most nodes (majority committed). Since the data has been committed to most nodes, no rollback will occur, and the data before this timestamp can be written to the checkpoint.

MongoDB needs to ensure frequent updates to stable timestamp (in a timely manner), otherwise WT checkpoint behaviors will be impacted, resulting in a lot of memory not being released.

Distributed Transactions

MongoDB 4.0 supports multi-document transactions against replica sets and plans to support sharding cluster transaction functionality in version 4.2. The following is a functional iteration diagram from MongoDB 3.0 introducing WiredTiger to 4.0 supporting multi-document transactions.

ApsaraDB for MongoDB

If you are planning on using MongoDB 4.0 for your enterprise, you should consider using ApsaraDB for MongoDB. Alibaba Cloud ApsaraDB for MongoDB is based on the Apsara distributed file system with high-performance storage. It features a three-node replica set, high-availability structure, DR switching, and completely transparent fault migration. It also provides solutions such as professional online database resizing, backup and rollback, and performance monitoring and analysis.