Search code examples
cassandra

Cassandra data corruption: NULL values appearing on certain columns


I'm running a Cassandra 3.9 cluster, and today I noticed some NULL values in some generated reports.

I opened up cqlsh and after some queries I noticed that null values are appearing all over the data, apparently in random columns.

Replication factor is 3.

I've started a nodetool repair on the cluster but it hasn't finished yet.

My question is: I searched for this behavior and could not find it anywhere. Apparently the random appearance of NULL values in columns is not a common problem.

Does anyone know what's going on? This kind of data corruption seems pretty serious. Thanks in advance for any ideas.

ADDED Details:

  • Happens on columns that are frequently updated with toTimestamp(now()) which never returns NULL, so it's not about null data going in.

  • Happens on immutable columns that are only inserted once and never changed. (But other columns on the table are frequently updated.)

Do updates cause this like deletions do? Seems kinda serious to me, to wake up to a bunch of NULL values.

I also know specifically some of the data that has been lost, three entries I've already identified are for important entries which are missing. These have not been deleted for sure - there is no deletion on one specific table which is full of NULL everywhere.

I am the sole admin and nobody ran any nodetool commands overnight, 100% sure.

UPDATE

nodetool repair has been running for 6+ hours now and it fully recovered the data on one varchar column "item description".

It's a Cassandra issue and no, there were no deletions at all. And like I said functions which never return null had null in them(toTimestamp(now())).

UPDATE 2

So nodetool repair finished overnight but the NULLs were still there in the morning.

So I went node by node stopping and restarting them and voilà, the NULLs are gone and there was no data loss.

This is a major league bug if you ask me. I don't have the resources now to go after it, but if anyone else faces this here's the simple "fix":

  1. Run nodetool repair -dcpar to fix all nodes in the datacenter.
  2. Restart node by node.

Solution

  • I faced a similar issue some months ago and it's explained quite well in the following blog (this is not written by me): WAT - Cassandra: Row level consistency #$@&%*!

    The null values actually have been caused by updates in this case.