Search code examples
consensusraft

Is there a race condition in RAFT?


When the leader get a log entry, it replicates it to the other servers in the cluster. Then it commits the entry and tells the other servers to commit as well. There seems to be two cases here:
1) The leader commits the entry and then tells the other servers to commit as well.
2) The leader tells everyone to commit and then it does also.

In #1, if the leader crashes before telling the others to commit, then does whoever becomes the new leader use the entry even though it is not committed? If not, then we have some logs which are not in sync for the latest entry. (The old leader would have applied it and the other would not have.) If so, then how does it know that it is ok to commit it?

In #2, if the leader crashes before it can commit, then all of the other nodes crash after they commit and then in the election, the old leader becomes the new leader again, then the other servers have commited entries that the new leader doesn't have. What happens in this case?


Solution

  • It's important to note the difference between an entry being stored on a server, the entry being committed, and the entry being applied. Commitment is practically a theoretical concept. In most cases, servers don't do anything to commit an entry. It's committed by the fact that it's stored on a majority of servers and therefore is guaranteed not to be lost. Entries may be applied when they're committed or at some later time as long as servers apply them in order.

    Because of the nature of distributed systems, it's impossible for all of the servers to commit an entry at the same time. Instead, Raft guarantees only that an entry is persisted on a majority of servers by the time the leader applies it to its state machine. Most Raft implementations take approach #1 in order to allow the leader to apply the command to its state machine and respond to the client before other servers have to apply it to their state machines.

    What will happen if a leader applies a command and then fails is this:

    * We know that the command has been stored on a majority of servers therefore...
    * Raft's election algorithm guarantees that the next server that's elected has that entry
    * When the next leader is elected, it will append a no-op entry to its log and commit it
    * Once the no-op entry is committed, the leader will increase its commitIndex to that of the no-op entry and thereby commit all entries remaining from the previous term (including the original leader's last commit)
    * On the next heartbeat, the leader will send the index of the no-op as the `commitIndex`
    * Remaining followers will be replicated entries up to the leader's no-op and commit entries from the previous leader's term
    

    Does that make sense?

    So, what's important to note is that even if a leader doesn't have a chance to inform followers that an entry has been committed, Raft guarantees that the next leader will have the first leader's committed entries, and that leader will eventually replicate those entries to followers that don't already have them and the commit index will continue beyond the previous leader's last index.

    References: See section 5.4.2 of the Raft paper (https://ramcloud.atlassian.net/wiki/download/attachments/6586375/raft.pdf) for information on committing entries from prior terms