Multi-Document Transactions on MongoDB 4.0

At the end of last month, MongoDB World announced the release of MongoDB 4.0, which supports multi-document transactions against replica sets. Alibaba Cloud ApsaraDB Database R&D engineers first analyzed the source code of transaction functions and parsed the transaction implementation mechanism. In this article, we will discuss the implementation of multi-document transactions on MongoDB 4.0.

What’s New on MongoDB 4.0

The transaction functionality introduced by MongoDB 4.0 supports multi-document ACID features such as transaction operations using the mongo shell.

> s = db.getMongo().startSession()
session { "id" : UUID("3bf55e90-5e88-44aa-a59e-a30f777f1d89") }
> s.startTransaction()
> db.coll01.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> db.coll02.insert({x: 1, y: 1})
WriteResult({ "nInserted" : 1 })
> s.commitTransaction() (or s.abortTransaction() rollback transaction)

Python Version

with client.start_session() as s:
s.start_transaction()
collection_one.insert_one(doc_one, session=s)
collection_two.insert_one(doc_two, session=s)
s.commit_transaction()

Java Version

try (ClientSession clientSession = client.startSession()) {
clientSession.startTransaction();
collection.insertOne(clientSession, docOne);
collection.insertOne(clientSession, docTwo);
clientSession.commitTransaction();
}

Session

Session is a concept introduced in MongoDB 3.6. This feature is mainly used to achieve multi-document transactions. Session is essentially a "context."

  1. txnNmuber: The transaction number corresponding to the request, which must be monotonically incremented within a Session.
  2. stmtIds: It corresponds the ID to each operation in the request (in the case of insert, an insert command can insert multiple documents).
  1. writeConcern: MongoDB supports client-side flexible configuration write policies (writeConcern) to meet the needs of different scenarios.
  2. readConcern: MongoDB can customize the write strategy through writeConcern. After version 3.2, readConcern was introduced to flexibly customize the read strategy.
  3. readPreference: Rules for selecting nodes when setting read, see read preference
  4. retryWrites: If it is set to true, in the scenario of replica sets, MongoDB will automatically retry the scene where re-selection occurs; see retryable write

ACID

Atomicity, Consistency, Isolation, and Durability (ACID) is a set of properties for database transactions aimed at ensuring the validity of transactions under all circumstances. ACID plays an essential role for multi-document transactions.

Atomic

For multi-document transaction operations, MongoDB provides an atomic semantic guarantee of “all or nothing”. Data changes are only made visible outside the transaction if it is successful. When a transaction fails, all of the data changes from the transaction is discarded.

Consistency

Consistency is straightforward. Only permissible transactions are allowed on the database, which prevents database corruption by an illegal transaction.

Isolation

MongoDB provides a snapshot isolation level, creates a WiredTiger snapshot at the beginning of the transaction, and then uses this snapshot to provide transactional reads throughout the transaction.

Durability

When a transactions use WriteConcern {j: true}, MongoDB will guarantee that it is returned after the transaction log is committed. Even if a crash occurs, MongoDB can recover according to the transaction log. If the {j: true} level is not specified, even after the transaction is successfully committed, after the crash recovery, the transaction may be rolled back.

Transaction and Replica

In the replica set configuration, an oplog (a normal document, so the modification of the transaction in the current version cannot exceed the document size limit of 16 MB) will be recorded when the entire MongoDB transaction is committed, including all the operations in the transaction. The slave node pulls the oplog and replays the transaction operations locally.

"ts" : Timestamp(1530696933, 1), "t" : NumberLong(1), "h" : NumberLong("4217817601701821530"), "v" : 2, "op" : "c", "ns" : "admin.$cmd", "wall" : ISODate("2018-07-04T09:35:33.549Z"), "lsid" : { "id" : UUID("e675c046-d70b-44c2-ad8d-3f34f2019a7e"), "uid" : BinData(0,"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=") }, "txnNumber" : NumberLong(0), "stmtId" : 0, "prevOpTime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "o" : { "applyOps" : [ { "op" : "i", "ns" : "test.coll2", "ui" : UUID("a49ccd80-6cfc-4896-9740-c5bff41e7cce"), "o" : { "_id" : ObjectId("5b3c94d4624d615ede6097ae"), "x" : 20000 } }, { "op" : "i", "ns" : "test.coll3", "ui" : UUID("31d7ae62-fe78-44f5-ba06-595ae3b871fc"), "o" : { "_id" : ObjectId("5b3c94d9624d615ede6097af"), "x" : 20000 } } ] } }
  1. Set the OplogTruncateAfterPoint timestamp to the timestamp of the first oplog in the batch (stored in the set local.replset.oplogTruncateAfterPoint)
  2. Write all the oplogs in the batch to the set local.oplog.rs. If the number of oplogs is large, the write acceleration will be used.
  3. Clear OplogTruncateAfterPoint, and mark oplog to be completely successfully written. If a crash occurs before this step is completed, after restarting and recovery, it is found that oplogTruncateAfterPoint is set, then the oplog is truncated to the timestamp to restore a consistent status.
  4. The oplog is divided into multiple threads for concurrent replay. In order to improve the efficiency of the concurrency, the oplog generated by the transaction contains all the modifications, and will be divided into multiple threads according to the document ID like the oplog of a normal single operation.
  5. Update the ApplyThrough timestamp to the timestamp of the last oplog in the batch. After marking the next restart, resynchronize from that location. If it fails before this step, the oplog will be pulled from the last value of ApplyThrough (the last oplog of the previous batch).
  6. Update the oplog visible timestamp. If other nodes synchronize from the slave node, this part of the newly written oplog can be read.
  7. Update the local snapshot (timestamp) and the new write will be visible to users.

Transaction and Storage Engine

Unified Transaction Timing

WiredTiger has supported transactions for a long time. In versions 3.x, MongoDB uses WiredTiger transactions to guarantee the modification atomicity of data, index, and oplog. But in fact, MongoDB provides a transaction API after iterations of multiple versions. The core difficulty is timing.

/*
* __wt_txn_visible --
* Can the current transaction see the given ID / timestamp?
*/
static inline bool
__wt_txn_visible(
WT_SESSION_IMPL *session, uint64_t id, const wt_timestamp_t *timestamp)
{
if (!__txn_visible_id(session, id))
return (false);
/* Transactions read their writes, regardless of timestamps. */
if (F_ISSET(&session->txn, WT_TXN_HAS_ID) && id == session->txn.id)
return (true);
#ifdef HAVE_TIMESTAMPS
{
WT_TXN *txn = &session->txn;
/* Timestamp check. */
if (!F_ISSET(txn, WT_TXN_HAS_TS_READ) || timestamp == NULL)
return (true);
return (__wt_timestamp_cmp(timestamp, &txn->read_timestamp) <= 0);
}
#else
WT_UNUSED(timestamp);
return (true);
#endif
}

Impact of Transaction on the Cache

The WiredTiger (WT) transaction opens a snapshot, and the presence of snapshot impacts the WiredTiger cache evict. On a WT page, there are N modification versions. If these modifications are not globally visible (__wt_txn_visible_all), this page cannot be evicted (__wt_page_can_evict).

  1. The number of documents modified by a transaction cannot exceed 1,000, which cannot be modified.
  2. The oplog generated by a transaction modification cannot exceed 16 MB, which is the document size limit of MongoDB. The oplog is also a normal document and must comply with this constraint.

Read as of a Timestamp and Oldest Timestamp

Read as of a timestamp relies on WiredTiger to maintain multiple versions in memory. Each version is associated with a timestamp. As long as the MongoDB layer may need to read the version, the engine layer must maintain the resources of this version. If there are too many reserved versions, it puts a lot of pressure on the WT cache.

Engine Layer Rollback and Stable Timestamp

In versions 3.x, the rollback operation of the MongoDB replica set is done at the Server layer, but when a node needs to be rolled back, the reverse operation is continuously applied according to the oplog to be rolled back, or the latest version is read from the rollback source. The entire rollback operation is inefficient.

Distributed Transactions

MongoDB 4.0 supports multi-document transactions against replica sets and plans to support sharding cluster transaction functionality in version 4.2. The following is a functional iteration diagram from MongoDB 3.0 introducing WiredTiger to 4.0 supporting multi-document transactions.

ApsaraDB for MongoDB

If you are planning on using MongoDB 4.0 for your enterprise, you should consider using ApsaraDB for MongoDB. Alibaba Cloud ApsaraDB for MongoDB is based on the Apsara distributed file system with high-performance storage. It features a three-node replica set, high-availability structure, DR switching, and completely transparent fault migration. It also provides solutions such as professional online database resizing, backup and rollback, and performance monitoring and analysis.

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store