Search code examples
raft

Lecture 6: Fault Tolerance: Raft (1) MIT 6.824: Distributed Systems


I am watching Lecture 6: Fault Tolerance: Raft (1) MIT 6.824: Distributed Systems. At 1:16:25 Prof is discussing scenarios which are possible. Video Link: [https://youtu.be/64Zp3tzNbpE?t=4585] According to him, the following scenario looks possible.

    10   11   12   13   14   15
S1: 3
S2: 3    3    4
S3: 3    3    5

Explanation:

  1. At 10, all servers were working in term 3.
  2. Then at 11, S1 crashes, still S2 & S3 are in majority, so log gets replicated for 11
  3. Then at 12, S3 crashes, S2 becomes leader with new term 4, writes in the log, but before replication it crashes.
  4. Next S3 becomes leader again for term 5 and writes to its log.

I believe this is not possible, because S3's last known term is 3, so it cannot jump to term 5 for new election, because election at term 4 only S2 knows about it.

Trying to understand how this scenarios is possible.


Solution

  • Around 1:15:50 there is a tiny explanation.

    We are in a good state at log position 11 :

    • S1 is offline
    • S3 is the leader and accepted a command from a client
    • S2 is a follower
    • Both S3 and S2 know the term is 3

    Now S3 goes offline just for a fraction of time, so S2 does not hear a heartbeat and S2 initiates an election for term 4.

    S3 comes back online right before RequestVote arrives, and S3 votes YES for S2 being a leader for term 4. This is the point when S3 learns the current term is 4, I think this is the core of your question.

    At this point we are still at log position 11, since no commands from a client arrived.

    State of the system:

    • S1 is offline
    • S2 is the leader with term 4 with latest log entry at 11 (term 3)
    • S3 is the follower, who voted for S2 at term 4; hence S3 knows about term being 4; with latest log entry at 11 (term 3)

    Now this is the sequence of next events:

    • a client send a command to S2
    • S2 is a leader for term 4, so S2 writes an entry to log position 12 (term 4)
    • right before S2 sends a message to S3 (and S1), S2 goes offline (or disconnects from the network)
    • S2 being offline triggers an election for S3 with term 5 (as S3 was aware the last term was 4)
    • S1 (NOT S2) comes back online right before RequestVote arrives, S1 votes for S3 for term 5 and becomes a follower
    • a command arrives from a client to S3, S3 is the leader for term 5, S3 writes term5 entry to log position 12 and S3 crashes.

    The last sequence of events leads to the log config described in the question.