Search code examples
apache-zookeeperpaxos

how does a zookeeper cluster of 3 stay active when 1 node is down?


the documentation here says that:

A 3 server ensemble allows for a single server to fail and the service will still be available.

However, for a quorum to be established, there need to be ceil(n/2)+1 nodes

In the case of 3 nodes, that is:
ceil(3/2)+1 = ceil(1.5)+1 = 3

So if 1 node is down, the quorum should not be established and zookeeper should go down.

Is the above documentation wrong in this case?


Solution

  • The quorum is 2 in a three node cluster as that is a majority. Any two majorities in space or time must overlap; so cannot be unaware of actions in any other majority as at least one node is in both majorities. That's a fundermental property used by the Paxos algorithm (note Zookeeper uses ZAB not Paxos my point is that safety in a consensus algorithm uses majorities). So your calculation should be floor(N/2)+1 which gives a quorum of 2 in a 3 node cluster, 3 in a 5 node cluster, 4 in a 7 node cluster, etc.