A Brief Analysis of Consensus Protocol: From Logical Clock to Raft

Logical Clock

A logical clock is not actually a consensus protocol. It is an idea put forward by Lamport in 1987 to solve possible problems caused by clock inconsistency between different machines in a distributed system. In a standalone system, machine time is used to identify events so that we can clearly identify the order of occurrence for two different events. However, in a distributed system, because time deviation may vary from machine to machine, event order cannot be decided through physical clocks. In fact, in a distributed system, we only pay attention to the order of two associated events. Consider two transactions: One is the modification of row A and the other is the modification of row B. We actually don’t care which of the two transactions occurs first. The so-called logical clock is what is used to define the order of occurrence for two associated events, that is, “happens before”. A logical clock cannot determine the order of occurrence for unassociated events. So, this “happens before” relation is actually a partial ordering relation.

Replicated State Machine

When it comes to consensus protocols, we usually talk about the state machine replication. Usually state machine replication and consensus protocol algorithms are used together to implement high availability and fault tolerance in distributed systems. Many distributed systems use state machine replication to synchronize data between copies, such as HDFS, Chubby, and Zookeeper.

Paxos

Paxos is a consensus protocol algorithm developed by Lamport in the 1990s. It was widely found hard to understand. Therefore, Lamport published a new paper “Paxos Made Simple” in 2001, where he said that Paxos is the simplest consensus algorithm in the world and very easy to understand. However, it is still generally considered hard to understand in this industry. After reading Lamport’s papers, I think that the Paxos protocol itself is not hard to understand apart from the complex process of argument for correctness. However, the Paxos protocol is too theoretical and far from being applied in specific engineering practices. I was also very confused when I first learned about the Paxos protocol. I read the papers many times and found that this protocol is just for single-event consensus and that the agreed value cannot be modified. How can we use Paxos to implement state machine replication? In addition, only Propose and Follower know the agreed values based on the Paxos protocol. How can we actually use this protocol? However, the Paxos protocol is a lot easier to understand if you only consider this protocol theoretically and do not consider problems that may occur in actual engineering. In Lamport’s papers, the application of state machines is just a general idea and no specific implementation logic is included. It is impossible to directly use Paxos for state machine replication. Instead, we need to add many things to Paxos. That is why Paxos has so many variants.

Basic Paxos

Basic Paxos is the Paxos algorithm first put forward by Lamport. In fact, it is simple and can be explained in just a few words. Next, I will describe Paxos in my own words and give an example. To understand Paxos, simply remember one thing: Paxos can only enable consensus for one value and the proposal cannot be changed once decided. That is to say, the entire Paxos Group only accepts one proposal (or several proposals with different values). As to how to accept multiple values to implement state machine replication, see Multi Paxos in the next section.

Multi-Paxos

As mentioned previously, Paxos is in the theoretical phase and cannot be directly used for state machine replication. The reasons are as follows:

  • Paxos can only determine one value and cannot be used for continuous log replication.
  • The presence of multiple Proposers may lead to livelock. In the previous example, Server2 submits a proposal twice before a proposal is finally accepted. In some extreme scenarios, more submittals of proposals may be required.
  • The final result of a proposal is only known to partial Acceptors. This cannot guarantee that each instance for state machine replication has a completely consistent log.

ZAB

ZAB (ZooKeeper Atomic BoardCast) is a consensus protocol used in ZooKeeper. ZAB is a dedicated protocol for Zookeeper. It is strongly bound to Zookeeper and has not been extracted into an independent database. Therefore, ZAB is not widely used and only limited to Zookeeper. However, the papers on ZAB protocol thoroughly prove that ZAB has the ability to meet the strong consistency requirement.

Raft

Raft is a new consensus protocol developed by developers at Stanford University in 2014. The developers developed this new consensus protocol because they considered Paxos difficult to understand. In addition, Paxos is only a theory and far from being applied in actual engineering. The developers of Paxos listed some disadvantages of Paxos:

  1. The Paxos protocol does not require a leader. Each Proposer can create a proposal. Leader selection and consensus agreement are separated at the very beginning of designing Raft, while leader selection and proposal are mixed together in Paxos, making Paxos hard to understand.
  2. The original Paxos protocol is only to reach consensus on one single event. Once a value is determined, it cannot be modified. However, in realistic scenarios (including database consistency), it is required to continuously reach consensus on the value of a log entry. Therefore, the Paxos protocol itself cannot meet the requirement: We need to make some improvements and supplements to the Paxos protocol to apply Paxos in engineering in a real sense. Making supplements to the Paxos protocol is very complex. Although the Paxos protocol has been proven by Lamport, the Paxos-based and improved algorithms like Multi-Paxos are unproven.
  3. Another disadvantage is that Paxos only provides a rough description. This requires that subsequent improvements on Paxos and projects that use Paxos like Google Chubby have to implement a set of projects to solve specific problems in Paxos. The implementation details of projects like Chubby are not made public. That is to say, to apply Paxos in your own projects, basically you have to customize and implement a set of Paxos protocols that meet your specific requirements.

Closing Words

Currently, the improved Paxos protocol has been used in many distributed products, such as Chubby, PaxosStore, Alibaba Cloud X-DB, and Ant Financial OceanBase. It is generally believed that the Raft protocol has lower performance than Paxos because it only allows committing entries in sequence. However, TiKV that uses Raft officially declares that it has made many optimizations on Raft and has significantly improved the performance of Raft. POLARDB is another Alibaba Cloud database that also uses Parallel-Raft (the improved version of Raft) to implement the parallel commit capability in Raft. I believe that more Paxos/Raft-based products will be available in the future and that more improvements will be made to Raft/Paxos.

References

  1. Time, clocks, and the ordering of events in a distributed system
  2. Implementing fault-tolerant services using the state machine approach- A tutorial
  3. Paxos Made Simple
  4. Paxos made live- An engineering perspective
  5. Multi-Paxos (one PPT presentation at Standford University)
  6. Zab- High-performance broadcast for primary-backup systems
  7. In search of an understandable consensus algorithm (Raft)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com