I recently read this great article from the cockroachdb blog which talks about how they maintain consistency in a similar way to spanner but without atomic clocks. Here is the relevant part of the post:
When CockroachDB starts a transaction, it chooses a provisional commit timestamp based on the current node's wall time. It also establishes an upper bound on the selected wall time by adding the maximum clock offset for the cluster [commit timestamp, commit timestamp + maximum clock offset]. This time interval represents the window of uncertainty.
...
It's only when a value is observed to be within the uncertainty interval that CockroachDB-specific machinery kicks in. The central issue here is that given the clock offsets, we can't say for certain whether the encountered value was committed before our transaction started. In such cases, we simply make it so by performing an uncertainty restart, bumping the provisional commit timestamp just above the timestamp encountered. Crucially, the upper bound of the uncertainty interval does not change on restart, so the window of uncertainty shrinks. Transactions reading constantly updated data from many nodes may be forced to restart multiple times, though never for longer than the uncertainty interval, nor more than once per node.
Specifically I don't understand why the upper bound of the window of uncertainty doesn't also have to be bumped during an uncertainty restart. Here is an example to illustrate my confusion:
Imagine we have two writes A and B (on different nodes). Write A has a timestamp 3 and B a timestamp of 5. Assuming a maximum clock offset of 3 units of time, if we start a transaction on a node that's clock currently reads 1 we will construct an uncertainty window of [1, 4]. When we come across write A we will perform an uncertainty restart to include it and reduce the uncertainty window to (3, 4]. When we come across write B we will ignore it as it lies above the upper-bound of the uncertainty window. But, as our maximum clock offset is 3 units and A & B's timestamps are less than three units apart, B could have happened before A. But we have included A and not B in our transaction so we don't have consistency.
Thanks in advance for pointing out what I am missing!
Great question. This is pretty subtle; I'll try to explain what's going on.
First let's look at the transactions involved. We have two writing transactions, A (ts=3) and B (ts=5). If there was any overlap in the keys touched by these two transactions, they would not be able to commit "out of order" and we could be certain that transaction A happened before transaction B. But if there were no overlapping keys, then there is no point of coordination that would guarantee this order except the clock, and since the timestamps are close enough together it's ambiguous which one "happened first".
What we'd like to do in the presence of this ambiguity is to assign some sort of order, to deem transactions to have happened in the order implied by their timestamp. This is safe to do as long as we know that no one observed a state in which B was written but not A. This is enforced by the mechanisms inside CockroachDB that govern read/write interactions (mainly the poorly-named TimestampCache
). If transaction B was followed by a transaction C at timestamp 6 which read keys A and B, transaction A would no longer be allowed to commit at timestamp 3 (it would be pushed to timestamp 7).
This works as long as the database can see the relationships between reads and writes. There are sometimes situations in which we can't see this, if write B has some out-of-band causal dependency on A but there was never a transaction that touched both keys. This is the anomaly that we call "causal reverse" and is a difference between CockroachDB's serializable isolation and Spanner's linearizable (or strict serializable) isolation level.
Updating the upper bound of the uncertainty interval would be one way to avoid the causal reverse anomaly, although it would have severe practical drawbacks - you may have to keep restarting over and over until you find a gap of at least your max clock offset between writes on all the keys your transaction touches. This may never succeed.