I'm the newbie in etcd and have some confusion points about log replication:
How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.
func (pr *Progress) MaybeUpdate(n uint64) bool {
var updated bool
if pr.Match < n {
pr.Match = n
updated = true
pr.ProbeAcked()
}
pr.Next = max(pr.Next, n+1)
return updated
}
Questions you asked are more raft than etcd related (etcd implements raft, so they are still relevant tho). To get high level understanding of raft algorithm I highly recommend you to to check out raft webpage and raft paper (it's really nicely written!). I believe that section 5.3 "Log replication" would be helpful.
First let's put some foundation: Leader keeps track of matching entries with every follower. It keeps information in nextIndex[]
and matchIndex[]
in the paper (check Fig. 2) and in ProgressMap
in etcd.
// ProgressMap is a map of *Progress.
type ProgressMap map[uint64]*Progress
type Progress struct {
Match, Next uint64
...
}
Now let's jump to your questions.
- For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first. How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.
Here all depends on state of the follower (from leader perspective). Let's dive into StateProbe
and StateReplicate
.
In StateProbe
leader tries to figure out which entries it should send to the follower. It sends one message at the time and waits for response (which might be reject response in which case leader have to decrease Next
related to this follower and retry). In this state sending 2 different MsgApp
to the same follower is not possible.
In StateReplicate
leader assumes that network is stable and sends (potentially) many MsgApp
messages. Let's work on example.
Match := 2
, Next := 2
Follower log : [E1, E2]
(E stands for "entry")
Leader log: [E1, E2]
In this state leader gets put request for entries E3, E4 and E5. Let's assume that max batch size is 2 and thus all new entries can't be send in single message. Leader will send 2 messages: (Index: 3, Entries: [E3, E4])
and (Index: 5, Entries: [E5])
. Second message will be send before ack for first one is obtained. In case in the picture, follower gets first message, checks if it can append it by using Index
from request (check is performed in (raft).handleAppendEntries > (raftLog).maybeAppend > (raftLog).matchTerm > (raftLog).term
), appends entries to it's log and sends ack. Later on, follower gets 2nd request and does the same for it (checks if it can append it and sends ack).
Fact that follower checks if it can append entries before sending ack is important here. Once leader get ack for any message it is sure that all entries up to Index + len(Entries)
are populated in follower's log (otherwise this message would be rejected instead of acked). Thanks to that, it is not important if first ack is delayed (or even lost).
- How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.
I'll focus on etcd now as in raft paper it is described as "the leader retries AppendEntries RPCs indefinitely", which is rather non constructive. Every short interval, leader sends MsgHeartbeat
to the follower and latter responds with MsgHeartbeatResp
. As part of MsgHeartbeatResp
handling, leader does following
if pr.Match < r.raftLog.lastIndex() {
r.sendAppend(m.From)
}
Which should be read as: "If there is any entry that is not present on the follower, send him first missing entry". This can be seen as retry mechanism as pr.Match
will not increase without ack from follower.