Search code examples
consensusraft

RAFT consensus protocol - Should entries be durable before commiting


I have the following query about implementation RAFT:

Consider the following scenario\implementation:

  1. RAFT leader receives a command entry, it appends the entry to an in-memory array It then sends the entries to followers (with the heartbeat)
  2. The followers receive the entry and append it to their in-memory array and then send a response that it has received the entry
  3. The leader then commits the entry by writing it to a durable store (file) The leader sends the latest commit index in the heartbeat
  4. The followers then commit the entries based on leader's commit index by storing the entry to their durable store (file)

One of the implementations of RAFT (link: https://github.com/peterbourgon/raft/) seems to implement it this way. I wanted to confirm if this fine.

Is it OK if entries are maintained "in memory" by the leader and the followers until it is committed? In what circumstances might this scenario fail?


Solution

  • I found the answer to the question by posting to raft-dev google group. I have added the answer for reference.

    Please reference: https://groups.google.com/forum/#!msg/raft-dev/_lav2NeiypQ/1QbUB52fkggJ

    Quoting Diego's answer:

    For safety even in the face of correlated power outages, a majority of servers needs to have persisted the log entry before its effects are externalized. Any less than a majority and those servers could permanently fail, resulting in data loss/corruption

    Quoting from Ben Johnson's answer to my email regarding the same:

    No, a server has to flush entries to disk before being considered part of the quorum.

    For example, let's say you have a cluster of nodes called A, B, & C where A is the leader.

    1. Node A replicates an entry to Node B.

    2. Node B stores entry in memory and responds to Node A.

    3. Node A now has a quorum and commits the entry.

    4. Node A then gets partitioned away from Node B & C.

    5. Node B then dies and loses the in-memory copy of the entry.

    6. Node B comes back up.

    7. When Node B & C then go to elect a leader, the "committed" entry will not be in their log.

    8. When Node A rejoins the cluster, it will have an inconsistent log. The entry will have been committed and applied to the state machine so it can't be rolled back.

    Ben