Understanding the Failover Mechanism of Redis Cluster
By Yuxun
Community Redis cluster is a cluster architecture of acentric node P2P, and internally it adopts the gossip protocol to transmit and maintain the topological structure of the cluster and the cluster metadata. Visit the following link for the official Redis cluster tutorial
Failover is the fault tolerance mechanism provided by Redis cluster, and one of the most core functions of the cluster. Failover supports two modes:
- Failure failover: Availability of an automatic recovery cluster
- Artificial failover: Operable and maintainable operation of a support cluster
Failure Failover
Failure failover is manifested in the process of a slave assuming control of a master after a master fragment failure.
Probe stage
All of the fragments in a cluster are transmitted by the gossip protocol. The probe steps are:
(1) The non-traversal cluster nodes in cron send pings. They randomly select the node with the oldest pong_recv from five nodes and send the ping from pong_recv in the re-traversed nodes to the node of timeout/2.
(2) As for the time during which it has retraversed each node and times out after sending the ping packet and no pong packet has been received, the timeout will set the corresponding fragment to a pfail state, and in the course of the gossip packet with other nodes each node will carry a packet marked as pfail state.
(3) After each normal fragment has received a ping packet, the master fragment in the statistics cluster sets the failure nodes as pfail, and if more than one-half of the nodes are set as pfail the node setting will be fail state. If this fragment belongs to the slave nodes of a failure node, it will broadcast of its own accord the failure node as a fail state.
The core flow path is explained by the 3 node clusters in the figure below:
Preparation Stage
In the cron function, a slave node acquiring a master node state is fail, and it initiates of its own accord one failover operation; this operation is not executed immediately, but rather several constraints have been designed:
(1) Expired time out is not executed. How to judge whether or not it is sufficiently expired? data_age = current time point — time point of the last loss of contact with the master — time out time
If the data_age is greater than the ping interval time from master to slave + time out time * cluster_slave_validity_factor, then it will be regarded as expired. cluster_slave_validity_factor is one setting, and the smaller the cluster_slave_validity_factor is set the harder it is for it to trigger failover.
(2) The time failover_auth_time of delayed execution is calculated, and failover_auth_time = current time + 500 ms + random value of 0–500 ms + rank of current salve * 1s. The rank is calculated based on the already synchronized offset. The larger the delay of offset synchronization, the greater the rank will be, and the longer the said slave will postpone the time for triggering failover. Simultaneous failure of several slaves is thereby avoided. Failover will only be executed if the current time reaches the time of failover_auth_time.
Execution Stage
(1) It automatically increments currentEpoch, and the reassignment gives failover_auth_epoch
(2) It initiates a failover vote to another master fragment, and awaits the voting results
(3) After the other master fragment receives a CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST request, it will judge whether or not it accords with the following circumstances:
- Epoch compulsory >== epoch of the master nodes of all cluster views
- The initiator is the salve
- The slave’s master is already in a fail state
- It only votes once in the same epoch
- It only votes once when it is within a time of time out (cluster_node_timeout) * 2
(4) The other masters respond CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK, and the slave end does the statistics after it receives this
(5) When it is judged that statistics exceed one-half of the master responses in cron, it starts executing failover
(6) It marks its own node as master
(7) Clean up duplicate links
(8) It resets the cluster topological structure information
(9) It broadcasts to all of the nodes in the cluster
The core flow path is explained by the 3 node clusters in the figure below:
Artificial Failover
Artificial failover supports failover in three modes: default, force, takeover
Default
(1) CLUSTERMSG_TYPE_MFSTART is sent to the master
(2) After the master receives it, it sets clients_pause_end_time = current time + 5s * 2 and clients_paused = 1, and client suspends all requests; newly established requests will be added to the block client list.
(3) The master carries the information of repl_offset in the ping packet
(4) The slave examines the repl_offset of the master, and confirms that synchronization is already finished.
(5) It is set to mf_can_start = 1, and the normal failover flow path is started in cron. It is not necessary to delay execution like the failure failover setting, but rather the operation is executed immediately, and it is not necessary to consider whether or not the master is in a fail state during the voting of other masters.
Force
The state of master-slave synchronization is ignored, mf_can_start = 1 is set, and it is marked failover and is started.
Takeover
Steps 6–9 of failure failover are executed directly, master-slave synchronization is ignored, and voting of the other masters of the cluster is ignored.
Conclusion
ApsaraDB for Redis is a stable, reliable and scalable database service with superb performance. It is structured on Apsara Distributed File System and full SSD high-performance storage, and supports master-slave and cluster-based high-availability architectures. It offers a full range of database solutions including disaster recovery switchover, failover, online expansion, and performance optimization. We welcome everyone to purchase and use ApsaraDB for Redis.