Search code examples
cassandrarepair

Cassandra repair FAILED without ERROR log


I tried to repair a table using the command (version 4.0)

nodetool repair -tr -inc -st -1013347141143265728 -et -1000918482763387932 keyspace table_name

Then the console output stop at log below:

[2024-06-19 10:37:07,190] /10.0.40.8: Adding to parent_repair_history memtable
[2024-06-19 10:37:07,190] /10.0.40.8: Enqueuing WRITES.WRITE response to /10.0.40.8
[2024-06-19 10:37:07,190] /10.0.40.8: Sending WRITES.WRITE message to /10.0.40.14, size=26 bytes

After one day I check the status with repair_admin it shows state FAILED, console start printing

[After waiting for poll interval of 300 seconds] queried for parent session status and 2024-06-20 10:35:00,143 couldn't find repair status for cmd: 5
[After waiting for poll interval of 300 seconds] queried for parent session status and 2024-06-20 10:40:00,145 couldn't find repair status for cmd: 5
[After waiting for poll interval of 300 seconds] queried for parent session status and 2024-06-20 10:45:00,146 couldn't find repair status for cmd: 5

Next step I check system.log & debug.log (with grep repair UUID), but there's no ERROR log, DEBUG & INFO log seems normal, I even find
INFO  [AntiCompactionExecutor:24] 2024-06-19 10:35:07,367  CompactionManager.java:739 - [repair #85e6e5f0-2de4-11ef-b649-2b6780b7b939] Completed anticompaction successfully

And the only WARN log is below:

WARN  [OptionalTasks:1] 2024-06-20 10:42:53,804  LocalSessions.java:273 - Auto failing timed out repair session LocalSession{sessionID=85e6e5f0-2de4-11ef-b649-2b6780b7b939, state=PREPARED, coordinator=/10.0.40.14, tableIds=[5320ff80-8d59-11ec-a59e-7309d471ee5e], repairedAt=1718764499422, ranges=[(-1013347141143265728,-1000918482763387932]], participants=[/10.0.40.1, /10.0.40.14, /10.0.40.10], startedAt=1718764499, lastUpdate=1718764507}

Am I using the wrong way to trace the problem? How can I debug repair FAILED ?


Solution

  • It is incorrect to assume that the repair failed just because nodetool repair_admin was unable to get the status. As the message indicates:

    ... queried for parent session status and ... couldn't find repair status for cmd ...
    

    it really is as simple as it couldn't find the repair status. If the repair failed, repair_admin would clearly report "... repair failed".

    It is not unusual for repair_admin to not be able to determine the status of a repair. It could happen for various reasons such as nodes being busy at the time so it does mean that you'll have to investigate.

    You can manually troubleshoot the repair session using the session ID. For instance, you can go through the Cassandra logs stepping through messages with the session ID 85e6e5f0-2de4-11ef-b649-2b6780b7b939. You'll need to correlate log messages on the repair coordinator node (coordinator=/10.0.40.14) and the replicas involved in the repair (participants=[/10.0.40.1, /10.0.40.14, /10.0.40.10]) to get an idea on what happened. Cheers!