Search code examples
computer-sciencedistributeddistributed-computingpaxosconsensus

Using Paxos to synchronize a large file across nodes


I'm trying to use Paxos to maintain consensus between nodes on a file that is around 50MB in size, and constantly being modified at individual nodes. I'm running into issues of practicality. Requirements:

  1. Sync a 50MB+ file across hundreds of nodes
  2. Have changes to this file, which can be made from any node, and aren't likely to directly compete with each other, propagated across the network in a few seconds at most
  3. New nodes that join the network can within a few minutes (<1 hour) build up the entire file by following along with the Paxos messages

The problem I'm facing is that there doesn't seem to be a way to accomplish both goals 2 and 3.

Here are the options I've considered so far:

  • Sync the entire file each round — Completely impractical, Paxos rounds would take minutes
  • Sync only changes to the file — Reasonable for goals 1 and 2, but breaks goal 3, as new nodes would only be able to sync the entire file once every unit of state has been changed
  • Sync changes & a random piece of the file each round — I'm not sure if Paxos allows for this. Nodes would be able to verify the changes against their own (allowing for new changes), and would be able to verify the random piece of the file against said piece of their version, but is this practical?

I'm thinking the third option is best, but I'm not sure if Paxos allows this. The idea would be to limit the data exchanged each round to maybe 1500 bytes, and fill that 1500 bytes with changes to the file primarily. Most rounds, the file would be unchanged, and the rounds in which something changed would most likely be less than 100 bytes of altered state, so the other 1400 bytes could be filled with some piece of the file, which would allow new nodes to build up the entire file over time. Is this practical? Has this problem already been solved?


Solution

  • As peter mentioned in the comments, an eventually-consistent is probably a better fit. Here's one such algorithm, based on the gossip protocol.

    Every node has some version of the file. Every N seconds each node connects to another node and they swap version numbers. If one node is behind the other it downloads the file from the peer.

    This converges remarkably quickly, I think within 10-20 gossip rounds for 1000 nodes.


    Update

    (Introducing raft or a messaging queue into the mix.)

    Raft

    From your comments it appears you've got a key-value store on your hands. You can think of it as a distributed state machine, in which you treat an update to each key as its own command. This is great for a consensus protocol like paxos or raft (I'm favoring raft now-a-days for the number of open source implementations). What's more, these are often implemented to also act like an atomic broadcast system. In short, a few nodes act as the core decision makers and the rest of the nodes listen to the results.

    (Of course, I don't know how your file is being update; i.e. if it is only updated on a master node and the remainders are slaves.)

    One major concern will be the fan-out to 1000s of nodes. For this, you'll probably want a hierarchal fan-out. There are various schemes to help out with this; here are a few ideas. A) Have each node connect to two random peers; and stream from the peer with the shortest path to the master node (this is called the power of two choices); or B) chose the peer with shortest route with some probability p. C) Have each node connect to one random peer and with some probability p, stream from that node else connect to its upstream node instead. These probabilities are meant to make an n-ary tree, which is a good balance between every node connecting to the master node, and every node in a linked list.

    Messaging Queue

    Now, paxos and raft provide some pretty strong guarantees. Specifically for this case, every update will be processed in order--across all keys. If you don't need that guarantee then you can architect a much simpler system.

    Each key update could be broadcast to a distributed messaging queue (like SQS, RabbitMQ, etc.) Version each key update and only apply an update if its greater than the version you have. This presents you with a beautiful and fast eventually consistent system.

    I'd go with this approach above using raft/paxos if the system allows for it.