Search code examples
cassandrahectorconsistency

Data in Cassandra not consistent even with Quorum configuration


I encountered a consistency problem using Hector and Cassandra when we have Quorum for both read and write.

I use MultigetSubSliceQuery to query rows from super column limit size 100, and then read it, then delete it. And start another around.

I found that the row which should be deleted by my prior query is still shown from next query.

And also from a normal Column Family, I updated the value of one column from status='FALSE' to status='TRUE', and the next time I queried it, the status was still 'FALSE'.

More detail:

  1. It has not happened not every time (1/10,000)
  2. The time between the two queries is around 500 ms (but we found one pair of queries in which 2 seconds had elapsed between them, still indicating a consistency problem)
  3. We use ntp as our cluster time synchronization solution.
  4. We have 6 nodes, and replication factor is 3

I understand that Cassandra is supposed to be "eventually consistent", and that read may not happen before write inside Cassandra. But for two seconds?! And if so, isn't it then meaningless to have Quorum or other consistency level configurations?

So first of all, is it the correct behavior of Cassandra, and if not, what data we need to analyze for further investment?


Solution

  • After check the source code with the system log, I found the root cause of the inconsistency. Three factors cause the problem:

    • Create and update same record from different nodes
    • Local system time is not synchronized accurately enough (although we use NTP)
    • Consistency level is QUORUM

    Here is the problem, take following as the event sequence

     seqID   NodeA         NodeB          NodeC
     1.      New(.050)     New(.050)      New(.050)
     2.      Delete(.030)  Delete(.030)
    

    First Create request come from Node C with local time stamp 00:00:00.050, assume requests first record in Node A and Node B, then later synchronized with Node C.

    Then Delete request come from Node A with local time stamp 00:00:00.030, and record in node A and Node B.

    When read request come, Cassandra will do version conflict merge, but the merge only depend on time stamp, so although Delete happened after Create, but the merge final result is "New" which has latest time stamp due to local time synchronization issue.