Search code examples
cassandranodetoolrepair

Cassandra Repair fails


Cassandra repair is failing to run with the below error on node 1. I earlier started multiple repair sessions in parallel by mistake. I find that there is a bug https://issues.apache.org/jira/browse/CASSANDRA-11824 which has been resolved for the same scenario. But I am already using cassandra 3.9 Please confirm if running nodetool scrub is the only workaround? Are there any considerations that we need to keep in mind before running scrub as I need to run this directly on Prod.

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #6546ce10-3a70-11ec-9336-394ae1cd743d on test/test_config, [(-1879129450237588992,-1867793788349541955], (-1228457230064908637,-1228389616821781301], (583169750278890460,583583127041100026]]] Validation failed in /10.11.22.123
        at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]

On node 2(10.11.22.123),

ERROR 17:33:12 Cannot start multiple repair sessions over the same sstables
ERROR 17:33:12 Failed creating a merkle tree for [repair #6546ce10-3a70-11ec-9336-394ae1cd743d on test/test_config, [(-1879129450237588992,-1867793788349541955], (-1228457230064908637,-1228389616821781301], (583169750278890460,583583127041100026]]], /10.11.22.789(node 1) (see log for details)
ERROR 17:33:12 Exception in thread Thread[ValidationExecutor:10,1,main]
java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
        at org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.markSSTablesRepairing(ActiveRepairService.java:526) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1318) ~[apache-cassandra-3.9.jar:3.9]

Solution

  • Nodetool tpstats revealed that there were indeed active repair jobs, but they were actually not running or compactionstats did not show any running jobs. So I restarted just the nodes on which the repair was stuck and this cleared up those stuck repair jobs and I was able to run a fresh repair after that.

    nodetool tpstats    
    Pool Name                    Active   Pending      Completed   Blocked  All time blocked
    MutationStage                     0         0      323161614         0                 0
    ViewMutationStage                 0         0              0         0                 0
    ReadStage                         0         0      339671804         0                 0
    RequestResponseStage              0         0      440712393         0                 0
    ReadRepairStage                   0         0       13751257         0                 0
    CounterMutationStage              0         0              0         0                 0
    Repair#3                          1      3525              3         0                 0
    .....