Search code examples
synchronizationreplicationapache-zookeeper

Apache ZooKeeper: How do writes work


Apache ZooKeeper is a kind of high available data-store for small objects. A ZooKeeper cluster consists of some nodes which all keep the whole dataset in their memory. The dataset is called "always-consistent", so every node has the same data at every time.

According to the documentation and blog posts, every node in the cluster can answer reads and accept writes.

  • Reads are always answered locally by the node, so no communication with the cluster is involved.
  • Writes are forwarded to a designated "Leader" node, which forwards the write-request to all nodes and waits for their replies. If at least half of the nodes answers, the write is considered successful.

Question: Why is it enough for the leader to wait for half of the nodes to reply? If somebody connects to one of the nodes which didn't receive the update, he gets an outdated result (only local read to local value).


Solution

  • In order to achieve high read-availability, Zookeeper guarantees a weak-consistency over the replicates: a read can always be answered by a client node, and the answer returned may be a stale value (even a new version has been committed through the leader).

    Then it is the users' responsibility to decide whether the answer for a read is "stale-able" or not, since not all applications require the up-to-date information. So the following choices are provided:

    1) If your application does not need up-to-date values for reads, you can get high read-availability by requesting data directly from the client.

    2) If your application requires up-to-date values for reads, you should use the "sync" API before your read request to sync the client-side version with the leader.

    So as a conclusion, Zookeeper provides a customizable consistency guarantee, and users can decide the balance between availability and consistency.

    If you want to know more about the internals of Zookeeper, I recommend this paper: ZooKeeper: Wait-free coordination for Internet-scale systems. The above strategy is described in Section 4.4.