Search code examples
apache-zookeeperetcdpaxosconsensusraft

Is a replication log necessary to achieve linearizability in distributed store


The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.

I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.

I my example if we have the following setup

  • machine A (Leader), contain version 1
  • machine B (Follower), contain version 1
  • machine C (Follower), contain version 1

And the write would go like this:

  1. Machine A receive Write request and store pending write V2
  2. Machine A send prepare request to Machine B and Machine C
  3. Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
  4. After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
  5. Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.

For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.


Solution

  • The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine. I was wondering if it's possible to design a similar system by simply using leader election and versioned values.

    Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).

    Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.

    Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.

    With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.

    So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:

    • with logs you need to care about log compaction and garbage collection
    • size of the log is limited by the size of one node
    • protocols for splitting a log and migration to a new cluster are not well-known

    And why those system decided to use a replication log.

    The log-based approach is older that the alternative, so it has more time to gain popularity.

    About your example

    It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.

    I believe if you describe them carefully you'll get a variant of Paxos.