How to Create Highly Available MongoDB Databases with Replica Sets

Primary Election

The replica set is initialized through the replSetInitiate command (or rs.initiate() for mongo shell). After the initialization, various members start to send heartbeat messages and initialize the primary node election. The node that wins the majority of votes is elected as the primary node while the remaining ones become the secondary nodes.

config = {
_id : "my_replica_set",
members : [
{_id : 0, host : "rs1.example.net:27017"},
{_id : 1, host : "rs2.example.net:27017"},
{_id : 2, host : "rs3.example.net:27017"},
]
}

rs.initiate(config)

Definition of Majority

Suppose that the number of voting members (to be introduced later) in the replica set is N. Majority is then defined by the formula (N/2) + 1, or (N+1)/2, for even and odd values of N, respectively. If the number of members alive in the replica set is less than the majority, the replica set cannot elect the primary node, and is not able to provide write services. The replica set is in the read-only status.

Special Secondary Node

In normal cases, the secondary node of the replica set will participate in the primary node election (it may also be elected as the primary node), and synchronize the data last written from the primary node to ensure its data consistency with the primary node.

Arbiter

The arbiter node only participates in the voting. It cannot be elected as the primary node and does not synchronize data from the primary node.

Priority0

The election priority of a Priority0 node is 0 and the Priority0 node will not be selected as the primary node.

Vote0

In MongoDB 3.0, you can set a maximum number of 50 replica set members, and a maximum number of 7 members participating in the primary node election. The vote attributes of other members (Vote0) must be set to 0, that is, they do not participate in the voting.

Hidden

The hidden node cannot be selected as the primary node (its Priority is 0) and is invisible to the Driver.

Delayed

The delayed node must be a hidden node, and its data lags behind that on the primary node for some time (this is configurable, such as one hour).
Because the data on the delayed node lags behind that on the primary node, you can recover data on the primary node by using historical data on the delayed node.

Data Synchronization

The primary node and the secondary node synchronize data through oplog. After the write operation is completed on the primary node, the primary node will write an oplog to the local.oplog.rs special set and the secondary node keeps pulling and applying the oplog from the primary node.

{
"ts" : Timestamp(1446011584, 2),
"h" : NumberLong("1687359108795812092"),
"v" : 2,
"op" : "i",
"ns" : "test.nosql",
"o" : { "_id" : ObjectId("563062c0b085733f34ab4129"), "name" : "mongodb", "score" : "100" }
}
  • ts: Operation time. The current timestamp + counter, and the counter is reset every second.
  • h: The global unique identifier of the operation.
  • v: The oplog version information.
  • op: Operation type.
  • i: Insert operation.
  • u: Update operation.
  • d: Delete operation.
  • c: Execute commands (such as createDatabase, and dropDatabase)
  • n: Null operation. It is for some special purposes.
  • ns: The targeted set of the operation.
  • o: Operation content. If it is an update operation:
  • o2: The operation query condition. Only the update operation contains this field.
  1. At T1, the secondary node synchronizes the data of all the databases on the primary node (except local) through the sensitive command combination of listDatabases + listCollections + cloneCollection. We suppose all the operations are completed at T2.
  2. Apply all the oplogs from the period of [T1-T2] from the primary node. Some operations may have been included in Step 1. But because of the idempotence of the oplog, the oplog can be applied repeatedly.
  3. Create indexes for corresponding sets on the secondary node according to the indexing settings of various sets on the primary node. (The _id index of every set has been completed in Step 1.)

Modifying Replica Set Configurations

When you need to modify the replica set, such as adding a member, deleting a member, or modifying the member configuration (such as priority, vote, hidden, and delayed among other attributes), you can use the replSetReconfig command (rs.reconfig()) to re-configure the replica set.

cfg = rs.conf();
cfg.members[1].priority = 2;
rs.reconfig(cfg);

Details on Primary Node Election

Apart from at the replica set initialization, the primary node election may also occur in the following scenarios:

  • The replica set is re-configured
  • The secondary node will trigger a new round of primary node election when it detects the primary node failure.
  • When the primary node performs an active stepDown (actively downgrade to the secondary node), a new round of primary node election will be triggered.

Inter-node Heartbeat

Members in a replica set will send a heartbeat message between each other every two seconds by default. If the heartbeat message of a node is not received for 10 seconds, the node is considered to have failed. If the failed node is the primary node, the secondary node (the premise is that it can be voted as the primary node) will initiate a new round of primary node election.

Node Priority

  • Every node is inclined to vote the node with the highest priority as the primary node.
  • A node with the priority of 0 will not take the initiative to trigger the primary node election.
  • When the primary node discovers a secondary node with a higher priority, and the data latency on the secondary node is within 10 seconds, the primary node will perform an active stepDown and make the secondary node with a higher priority eligible for being the primary node.

Optime

Only the node with the latest optime (the timestamp of the most recent oplog record) can be elected as the primary node.

Network partition

A node is eligible to be elected as the primary node only if it remains connected with a majority of voting nodes. If the primary node is disconnected from a majority of nodes, the primary node will take the initiative to downgrade to a secondary node. During network partitioning, multiple primary nodes may appear within a short period of time. To avoid this from happening, you should set the majority policy when you write data to the driver. This ensures that even if multiple primary nodes appear, only one primary node can successfully write data to the majority of nodes.

Read/Write Settings of the Replica Set

Read Preference

By default, all the read requests in the replica set are sent to the primary node and the driver can route the read requests to other nodes through setting the Read Preference.

  • Primary: The default rule is that all the read requests are sent to the primary node.
  • PrimaryPreferred: The primary enjoys priority. If the primary node is unreachable, the requests are sent to the secondary nodes.
  • Secondary: All the read requests are sent to the secondary node.
  • SecondaryPreferred: The secondary node enjoys priority. When all the secondary nodes are unreachable, the requests are sent to the primary node.
  • Nearest: The read requests are sent to the nearest reachable node (detected through the ping).

Write Concern

By default, the primary node returns the data as soon as it completes the write operation. The Driver can set the write success rules through setting Write Concern.

db.products.insert(
{ item: "envelopes", qty : 100, type: "Clasp" },
{ writeConcern: { w: majority, wtimeout: 5000 } }
)
cfg = rs.conf()
cfg.settings = {}
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)

Exception Handling (Rollback)

When the primary node is down and the primary node re-joins the set, if some data is not synchronized to the secondary node and there have been some write operations on the new primary node, the old primary node needs to roll back some operations to ensure the consistency of the dataset with the new primary node.

ApsaraDB for MongoDB

Alibaba Cloud ApsaraDB for MongoDB is fully compatible with the MongoDB protocol and offers a full range of database solutions for enterprises. It automatically creates a three-node MongoDB replica set for users, which encapsulates advanced functions such as DR switchover and failover, and provides complete transparency. In addition to its support for replica sets, ApsaraDB for MongoDB helps you to maintain a high-availability MongoDB database through its monitoring and alarms feature. You can also conveniently backup and recover data by leveraging its automatic backupfeature.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com