amazon-dynamodb distributed-system key-value-store eventual-consistency

Distributed eventual consistency Key Value Store

I find it difficult to convince myself the advantage of using complex design like DynamoDB over simple duplication strategy.

Let's say we want to build a distributed key/value data store over 5 servers. (each server has exactly the same duplica).

Eventual consistency system, like DynamoDB, typically uses complicated conflicts reconcile, vector timestamp, etc. to achieve eventually consistency.

But instead, why couldn't we simply do the following:

For write, client will issue the write command to all the servers. So all servers will execute the clients' write command in the same order. It will reply to clients before servers commit the write.
For read, client will just do a round robin, only one server at a time will take care of read command. (Other servers won't see the read command) Yes, client may experience temporary stale data, but eventually all replica will have the same dataset, which is the same semantic as DynamoDB.

What's the disadvantage of this simple design vs Complicated DynamoDB?

Solution

Your strategy has a few disadvantages, but their exact nature depends on details you haven't covered.

One obvious example is dealing with network segmentation. That is, when one part of your network becomes segmented (disconnected) from another part.

In this case, you have a couple of choices about how to react when you try to write some data to the server, and that fails. You might just assume that it worked, and continue as if everything was fine. If you do that, and the server later comes back on line, a read may return stale data.

To prevent that, you might treat a failed write as a true failure, and refuse to accept the write until/unless all servers confirm the write. This, unfortunately, makes the system as a whole quite fragile--in fact, much more fragile (at least with respect to writing) than if you didn't replicate at all (because if any of the servers go off-line, you can't write any more). It also has one more problem: it limits write throughput to the (current) speed of the slowest server, so even if they're all working, unless they're perfectly balanced (unlikely to happen) you're wasting capacity.

To prevent those problems, many systems (including Paxos, if memory serves) use some sort of "voting" based system. That is, you attempt to write to all the servers. You consider the write complete if and only if the majority of servers confirm that they've received the write. Likewise on a read, you attempt to read from all the servers, and you consider a value properly read if and only if the majority of servers agree on the value.

This way, up to one fewer than half the servers can be off-line at any given time, and you can still read and write data. Likewise, if you have a few servers that react a little more slowly than the rest, that doesn't slow down operations overall.

Of course, you need to fill in quite a few details to create a working system--but the fact remains that the basic concept is pretty simple, as outlined above.