Search code examples
databasecassandranosqlconsistencyeventual-consistency

How does cassandra handle write timestamp conflicts between QUORUM reads?


In the incredibly unlikely event that 2 QUORUM writes happen in parallel to the same row, and result in 2 partition replicas being inconsistent with the same timestamp:

When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?

Then the next question is how does the cluster reach consistency again since the data has the same timestamp?

I understand this situation is highly improbable, but my guess is it is still possible.

Example diagram: enter image description here


Solution

  • Here is what I got from Datastax support:

    Definitely a possible situation to consider. Cassandra/Astra handles this scenario with the following precedence rules so that results to the client are always consistent:

    Timestamps are compared and latest timestamp always wins If data being read has the same timestamp, deletes have priority over inserts/updates In the event there is still a tie to break, Cassandra/Astra chooses the value for the column that is lexically larger While these are certainly a bit arbitrary, Cassandra/Astra cannot know which value is supposed to take priority, and these rules do function to always give the exact same results to all clients when a tie happens.

    When a CL=QUORUM READ happens in a 3 node cluster, and the 2 nodes in the READ report different data with the same timestamp, what will the READ decide is the actual record? Or will it error?

    Cassandra/Astra would handle this for you behind the scenes while traversing the read path. If there is a discrepancy between the data being returned by the two replicas, the data would be compared and synced amongst those two nodes involved in the read prior to sending the data back to the client.

    So with regards to your diagram, with W1 and W2 both taking place at t = 1, the data coming back to the client would be data = 2 because 2 > 1. In addition, Node 1 would now have the missing data = 2 at t = 1 record. Node 2 would still only have data = 1 at t = 1 because it was not involved in that read.