Cassandra not always returning the expected data for the same query in a single datacenter, 5 replica set up

I've encountered an issue that I'm not really sure why it happens.

I have a Cassandra cluster, only 1 datacenter, 5 replicas, 3 as replication factor, and consistency of ONE in my app and cqlsh at the time of testing.

I was running with cqlsh a query similar to:

SELECT * FROM session where id='xxxxxxxxxxxxxxx' and device_id='xxxxxxxxxxxxxxxx';

I was randomly getting my data row populated, and other times the response was empty.

First I checked the status of the cluster and everything looked fine there. All nodes in "UN" state, around 60% ownership per node, 256 tokens each.

Then I run the getendpoints command like this:

nodetool getendpoints <keyspace> <table> "xxxxxxxxxxxxxxx"

And I saw 3 nodes holding this ID, which looks fine.

I then run repair in each of the nodes and the issue was gone away, but still I don't see what's wrong here.

The information was in the DB for a long period of time, not minutes, days.

I guess the issue is still there, but what can it be and how can I debug this easily or monitor?

Thank you for the help

Solution

The issue is that the data is not consistent. You can verify easily in that you were able to run the repair and then found that the data was consistent, which means your ONE CL is hitting a node that doesn't actually contain the data. If you require that level of consistency, your query will need a higher CL, two or local_quorum.

The reason your data is not consistent is probably due to dropped mutations somewhere. That could be the network, or overloaded nodes causing the dropped mutations. Either way, that's clearly what is happening.