Search code examples
etcdconsensusraft

What if log replication out-of-order of etcd raft?


I'm the newbie in etcd and have some confusion points about log replication:

  1. For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first.

sequence diagram

How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.

func (pr *Progress) MaybeUpdate(n uint64) bool {
    var updated bool
    if pr.Match < n {
        pr.Match = n
        updated = true
        pr.ProbeAcked()
    }
    pr.Next = max(pr.Next, n+1)
    return updated
}
  1. How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.

Solution

  • Questions you asked are more raft than etcd related (etcd implements raft, so they are still relevant tho). To get high level understanding of raft algorithm I highly recommend you to to check out raft webpage and raft paper (it's really nicely written!). I believe that section 5.3 "Log replication" would be helpful.

    First let's put some foundation: Leader keeps track of matching entries with every follower. It keeps information in nextIndex[] and matchIndex[] in the paper (check Fig. 2) and in ProgressMap in etcd.

    // ProgressMap is a map of *Progress.
    type ProgressMap map[uint64]*Progress
    
    type Progress struct {
        Match, Next uint64
        ...
    }
    

    Now let's jump to your questions.

    1. For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first. How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.

    Here all depends on state of the follower (from leader perspective). Let's dive into StateProbe and StateReplicate.

    In StateProbe leader tries to figure out which entries it should send to the follower. It sends one message at the time and waits for response (which might be reject response in which case leader have to decrease Next related to this follower and retry). In this state sending 2 different MsgApp to the same follower is not possible.

    In StateReplicate leader assumes that network is stable and sends (potentially) many MsgApp messages. Let's work on example.

    Match := 2, Next := 2

    Follower log : [E1, E2] (E stands for "entry")

    Leader log: [E1, E2]

    In this state leader gets put request for entries E3, E4 and E5. Let's assume that max batch size is 2 and thus all new entries can't be send in single message. Leader will send 2 messages: (Index: 3, Entries: [E3, E4]) and (Index: 5, Entries: [E5]). Second message will be send before ack for first one is obtained. In case in the picture, follower gets first message, checks if it can append it by using Index from request (check is performed in (raft).handleAppendEntries > (raftLog).maybeAppend > (raftLog).matchTerm > (raftLog).term), appends entries to it's log and sends ack. Later on, follower gets 2nd request and does the same for it (checks if it can append it and sends ack).

    Fact that follower checks if it can append entries before sending ack is important here. Once leader get ack for any message it is sure that all entries up to Index + len(Entries) are populated in follower's log (otherwise this message would be rejected instead of acked). Thanks to that, it is not important if first ack is delayed (or even lost).

    1. How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.

    I'll focus on etcd now as in raft paper it is described as "the leader retries AppendEntries RPCs indefinitely", which is rather non constructive. Every short interval, leader sends MsgHeartbeat to the follower and latter responds with MsgHeartbeatResp. As part of MsgHeartbeatResp handling, leader does following

    if pr.Match < r.raftLog.lastIndex() {
        r.sendAppend(m.From)
    }
    

    Which should be read as: "If there is any entry that is not present on the follower, send him first missing entry". This can be seen as retry mechanism as pr.Match will not increase without ack from follower.