Raft Consensus Protocol

LiveRunGrow
12 min readNov 16, 2020

This article is meant to be a summary post of the original Raft paper published as seen below. However, it will not include the sections on managing cluster membership and log compaction for now (Perhaps future edits to this article will include it).

Some figures below are also taken from my school’s lecture notes.

Link to the Raft Research paper: https://raft.github.io/raft.pdf

Introduction

  • A consensus algorithm is one that allows a cluster of machines to work together as a coherent group that can survive the failure of some of its members.
  • It is a fundamental general-purpose abstraction used in distributed systems: It gives the useful guarantee of getting all the participating servers to agree on something, as long as a certain number of failures is not exceeded. Fundamentally, the goal of consensus is not that of the negotiation of an optimal value of some kind, but just the collective agreement on some value that was previously proposed by one of the participating servers in that round of the consensus algorithm. With the help of consensus the distributed system is made to act as though it were a single entity.
  • Raft is a consensus algorithm for managing a replicated log. There are many other consensus algorithms such as Paxos.

Key Features of Raft

  • Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to the other servers.
  • Leader Election: Randomised timers are used to elect leaders.
  • Membership changes: ……..

Replicated State Machines (RSM)

Now, lets understand what is contained in each of the servers in a Raft cluster.

  • Consensus algorithms typically arise in the context of replicated state machines.
  • State machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. They are used to solve fault tolerance problems in distributed systems. For eg, large scale systems that have a single cluster leader such as HDFS, GFS, typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines are zookeeper.
  • Replicated state machines are typically implemented using a replicated log as shown above. Each server has a log with a series of commands which its state machine executes in order (same sequence across all servers).
  • Keeping the replicated log consistent is the job of the consensus algorithm.
  • The consensus module on a server receives commands from clients and adds them to its log. It communicates with other consensus modules on other servers.

(1) Raft Basics

  • Every server in a Raft cluster is in one of three states: leader, follower, or candidate.
  • In normal operation, there is only 1 leader and the rest are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader).
  • A candidate is simply a server that is a potential leader. A state in between a follower and leader. When a leader dies, all followers will undergo an election to become candidate. A successful candidate becomes a leader and all other unsuccessful candidates revert back to the state of followers.
  • Term: Time units. They are numbered with consecutive integers. Every term begins with an election.
  • Terms act as a logical clock in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.
  • Servers communicate using remote procedure calls (RPCs). There are two types: RequestVote RPC are initiated by candidates during elections and AppendEntries RPCs are initiated by leaders to replicate log entries and provide a form of heartbeat.

(2) Leader Election

  • Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
  • Leaders need to send periodic heartbeats (AppendEntries RPCs that carry no log entries) to maintain their authority. If a follower receives no communication, it will then assume there is no visible leader and an election will be called.
  • To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner in which all candidate will time out and begin a new election with a new term number (incremented by one).
  • Whether a server will decide to vote depends on whether the RPC term is bigger than itself and if its’ log is more complete.

(3) Log Replication

  • Once a leader has been elected, it begins receiving and processing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues Append Entries RPCs in parallel to each of the other servers to replicate the entry.
  • When the entry has been safely replicated, the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries Append Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.
  • The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders.
  • One important thing to note is that a log entry is only considered to be directly committed when it was replicated to a majority of servers at ****creation time****. Otherwise, it can only be indirectly committed when in a later term, a following (different) log is directly committed and appended. Leaders from later terms are able to help replicate this log but it won’t be counted as directly committed until it has been followed by another directly committed log.

Log Matching Property One: If two entries in different logs have the same index and term, then they store the same command. This follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log.

Log Matching Property Two: If two entries in different logs have the same index and term, then the logs are identical in all preceding entries. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

Important!

  • During normal operation, the logs of the leader and followers stay consistent, so the Append Entries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log) or if there is a network partition with a leader in each partition along with their own set of followers. A follower may be missing entries that are present in the leader or it may have extra entries not present in the leader.
  • The leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log.
  • The leader always maintains a nextIndex for each follower. This is the index of the log entry that the leader will send to that follower. When a leader is elected, it sets the nextIndex as the index after the last log entry in its own log. (Note: This value is the same for all followers at the start of the term).
  • To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs.
  • If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.
  • The worst case is when the entire log of a follower is incorrect, and it is reset to an empty one.
  • A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property).

(4) Safety

  • However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences. It will accept new writes and force the leader of the previous term to erase its log entries.
  • To prevent the situation from happening, the Raft algorithm adds restrictions on who could be elected as a leader to ensure that the leader for any given term contains all of the entries committed in previous terms (Leader completeness property). It guarantees that all committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader.
  • A candidate must contact a majority of the cluster in order to be elected, which means every committed entry must be present in at least one of those servers. A voter can only vote if the candidate’s log is more up-to-date than that itself. During the voting period, if a candidate does not have all the committed entries, then servers do not vote for it.
  • The Request Votes RPC adds this restriction and includes the latest log entry. The voter then compares this log entry with its own and only issues a vote if the log entry in the Request Votes RPC is at least as new as the latest log entry of the voter’s state machine.
  • If the term of the logs is different, then the log with the latest term is considered to be the most up-to-date. But if the term is the same, then the longer log is considered as the more up-to-date one.`
  • As mentioned, an entry from a leader is committed once that entry is stored on a majority of servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Explained in the figure below.
  • To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. (Once an entry is not committed, in the future terms it will never be directly committed ever again). Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.

Summary

Extra Notes:

a)

One common source of confusion is the difference between nextIndex and matchIndex. In particular, you may observe that matchIndex = nextIndex - 1. While nextIndex and matchIndex are generally updated at the same time to a similar value (specifically, nextIndex = matchIndex + 1), the two serve quite different purposes. nextIndex is a guess as to what prefix the leader shares with a given follower. It is generally quite optimistic (we share everything), and is moved backwards only on negative responses. For example, when a leader has just been elected, nextIndex is set to be index index at the end of the log. In a way, nextIndex is used for performance – you only need to send these things to this peer.

Even when a log might be present in a majority of the logs, we might not be certain that we can safely apply any of these entries.

b)

Why is commitIndexand lastAppliedvolatile?

commitIndex is volatile because Raft can figure out a correct value for it after a reboot using just the persistent state. Once a leader successfully gets a new log entry committed, it knows everything before that point is also committed. A follower that crashes and comes back up will be told about the right commitIndexwhenever the current leader sends it an AppendEntry.

lastApplied starts at zero after a reboot because the basic Raft algorithm assumes the service (e.g., a key/value database) doesn’t keep any persistent state. Thus its state needs to be completely recreated by replaying all log entries. Hence, it is possible for the same log to be applied more than once (before server crash and after server crash).

Good references:

The End :)

--

--

LiveRunGrow

𓆉︎ 𝙳𝚛𝚎𝚊𝚖𝚎𝚛 🪴𝙲𝚛𝚎𝚊𝚝𝚘𝚛 👩‍💻𝚂𝚘𝚏𝚝𝚠𝚊𝚛𝚎 𝚎𝚗𝚐𝚒𝚗𝚎𝚎𝚛 ☻ I write & reflect weekly about software engineering, my life and books. Ŧ๏ɭɭ๏ฬ ๓є!