I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?
Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE
on a table TABLE1
with a replication of 3. Now it may happen your write was written to NodeA
only and NodeB
and NodeC
might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM
, it may happen that NodeB
and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.