Many distributed systems (e.g. databases) say they can provide strong consistency. For example, assuming
N replicas of the data, a requirement that
W nodes acknowledge a write and
R replicas respond to a read, the Cassandra documentation says that as long as
R + W > N you will get strong consistency. Intuitively, that makes sense. But then I started thinking about this on the individual message level and I can't actually understand how it could be achieved.
To be specific, let's assume I have a Cassandra cluster with a replication factor of 3. For simplicity, let's assume only a single data partition so we have exactly 3 nodes in the system,
C. A client attempts to write some data,
x = 11, with a write consitency of
W = 3, that is, the write is only considered complete if all replicas acknowledge the write. So the client sends the write request to
A which then forwards it to
C. Let's assume
B ACKs the write but
C does not. The write should then fail. Another client then does a read with
R = 1 and happens to talk to
R + W = 1 + 3 = 4 > 3 this should be a strongly consistent read. However,
B has already ACK'd the write and there is thus at least some window of time where
B will return
x = 11 if asked (it may only be a window as
A might tell
B "never mind, the write failed"). If the client never retries its write we now have given wholly incorrect data to a client and it doesn't seem like we can consider this strong consistency.
We can start to think about schemes to fix this. For example, maybe the protocol is the nodes each ACK the message but won't return it until
A reaches out to them again and tells them to commit (i.e. a two-phase commit). But again we run into trouble as now we can have
C initially ACK, so
A tells them to commit but
C fails to get that message. As a result, a read from
C would fail to return
x = 11 even though the write appears to have succeeded. Attempts to fix this via additional rounds of messaging (e.g. each node has to ACK the commit phase) also inevitably run into issues as is proved by the two generals problem.
There's clearly something wrong with my reasoning here; Cassandra does provide strong consistency when used properly. My question is, at the node-to-node protocol level, how do they do it?
I think the answer here is that "strong consistency" here is something akin to uncommitted read meaning that dirty reads, as in my initial example, are in fact allowed and do happen. Indeed, I found this in the Cassandra documentation:
If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.