Search code examples
cassandracassandra-2.1

Cancelling ongoing compaction jobs in Cassandra


I have 3 node cluster. 2 out of 3 nodes show 100% CPU usage.

Seems We didn't not call repair and cleanup after changing consistency level (or we called it too late or it didn't complete)

Now we have 100k plus compaction tasks pending. And they eat 100% CPU.

I tried following

nodetool stop -- COMPACTION
nodetool stop -- INDEX_BUILD
nodetool stop -- VALIDATION
nodetool stop -- CLEANUP
nodetool stop -- SCRUB

No change. No error either.

Only message I got was

No files to compact for user defined compaction 

Whats issue ? How can I cancell on going jobs ?


Solution

  • Calling nodetool stop COMPACTION would stop current compactions. If you dont want it to start new compactions use nodetool disableautocompaction. Can then verify with nodetool compactionstats

    I am certain that this is not your problem however. With 100k pending compactions you will have too many sstables. Your node is hopelessly behind. Any reads at all will cause massive load. Also unless you have a huge heap, just trying to read from them will likely cause you to run low on heap space and GC issues. The GCs are likely the cause of your high load, if you check your CPU time, if its being spent in IO its likely from reads or streaming, if its in sys/usr its probably GCs. If its a GC issue you can take a heap dump and check to verify whats taking all the space.

    With 100k behind your node will probably never recover on its own. Your best bet will be probably be one of:

    • Replace it or even have it replace itself.
    • remove it from cluster with nodetool disablebinary/disablethrift/disablegossip then use nodetool compact to force compact all sstables. Depending on version and compaction strategy it may not work but you can use jmx to change the compaction strategy locally for that node only to STCS to make it work. If this cant be completed in the hinted handoff window its not worth the trouble of trying to make your cluster consistent again. Also this will only work if the load goes down when the node is removed from cluster.
    • Setup monitoring and alerting and never let it get that far behind again. Target sub 100 pending compactions.