Search code examples
replicationdistributed-computingdistributedconsensusraft

How does Raft guarantee that a leader can always be elected?


The Raft paper says:

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority, then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate

How does it guarantee, however, that there will always even be an electable leader (i.e. one that's as up-to-date as a majority of cluster)?

For example, let's say we have a cluster of three servers A, B, C with A as the leader. First log entry is stored in A and B, second log entry is stored in A and C. Then A crashes, and B and C try to elect a leader. But at this point there is no majority (i.e. 2 out of 3) servers that have both first and second entries. So it seems like leader election can't ever happen (unless A restarts, but Raft's supposed to be resilient to a failure of 1 out of 3 servers..)


Solution

  • The paper defines a "Log Matching Property" that is relevant to this scenario:

    • [..]
    • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

    Since A and C both contain the same second entry, C must also contain the first entry. This is ensured because:

    The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries.

    Until C has the entry that B has, it will reject further appends. So at some point in your scenario, C must have received that entry to finally accept the newer entry from A.

    Therefore, C is the most up to date between B and C. (It would reject a leadership vote from B.)