concurrency distributed-system distributed-transactions google-cloud-spanner cockroachdb

CockroachDB read transactions

I've been reading about the read-only lock-free transactions as implemented in Google Spanner and CockroachDB. Both claim to be implemented in a lock-free manner by making use of system clocks. Before getting to the question, here is my understanding (please skip the following section if you are aware of the machineries in both systems or just in CockroachDB):

Spanner's approach is simpler -- before committing a write transaction, Spanner picks the max timestamp across all involved shards as the commit timestamp, adds a wait, called commit wait, to for the max clock error before returning from it's write transaction. This means that all causally dependent transactions (both reads and writes) will have a timestamp value higher than the commit timestamp of the previous write. For read transactions, we pick the latest timestamp on the serving node. For example, if there was a write committed at timestamp 5, and the max clock error was 2, future writes and reads-only transactions will at least have a timestamp of 7.
CockroachDB on the other hand, does something more complicated. On writes, it picks the highest timestamp among all the involved shards, but does not wait. On reads, it assigns a preliminary read timestamp as the current timestamp on the serving node, then proceeds optimistically by reading across all shards and restarting the read transaction if any key on any shard reports a write timestamp that might imply uncertainty about whether the write causally preceeded the read transaction. It assumes that keys with write timestamps less than the timestamp for the read transaction either appeared before the read transaction or were concurrent with it. The uncertainty machinery kicks in on timestamps higher than the read transaction timestamp. For example, if there was a write committed at timestamp 8, and a read transaction was assigned timestamp 7, we are unsure about whether that write came before the read or after, so we restart the read transaction with a read timestamp of 8.

Relevant sources - https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ and https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Given this implementation does CockroachDB guarantee that the following two transactions will not see a violation of serializability?

A user blocks another user, then posts a message that they don't want the blocked user to see as one write transaction.
The blocked user loads their friends list and their posts as one read transaction.

As an example, consider that the friends list and posts lived on different shards. And the following ordering happens (assuming a max clock error of 2)

The initial posts and friends list was committed at timestamp 5.
A read transaction starts at timestamp 7, it reads the friends list, which it sees as being committed at timestamp 5.
Then the write transaction for blocking the friend and making a post gets committed at 6.
The read transaction reads the posts, which it sees as being committed at timestamp 6.

Now, the transactions violate serializability becasue the read transaction saw an old write and a newer write in the same transaction.

What am I missing?

Solution

CockroachDB handles this with a mechanism called the timestamp cache (which is an unfortunate name; it's not much of a cache).

In this example, at step two when the transaction reads the friends list at timestamp 7, the shard that holds the friends list remembers that it has served a read for this data at t=7 (the timestamp requested by the reading transaction, not the last-modified timestamp of the data that exists) and it can no longer allow any writes to commit with lower timestamps.

Then in step three, when the writing transaction attempts to write and commit at t=6, this conflict is detected and the writing transaction's timestamp gets pushed to t=8 or higher. Then that transaction must refresh its reads to see if it can commit as-is at t=8. If not, an error may be returned and the transaction must be retried from the beginning.

In step four, the reading transaction completes, seeing a consistent snapshot of the data as it existed at t=7, while both parts of the writing transaction are "in the future" at t=8.