java performance optimization cassandra deadlock

Why is my Cassandra node stuck with MutationStage increasing?

I'm using Cassandra to store pictures. We are currently mass migrating pictures from an old system. Everything works great for a while, but eventually we'd get a TimedOutException when saving which I assume is because the work queue was filled.

However, after waiting (several hours) for it to finish, the situation continues the same (it doesn't recover itself after stopping the migration)

There seems to be a problem with only 1 node, on which its tpstats command shows the following data

Cassandra tpstats

The pending MutationStage operations keep increasing even though we stopped the inserts hours ago.

What exactly does that mean? What is the MutationStage?

What can I check to see why it isn't stabilising after so long? All the other servers in the ring are at 0 pending operations.

Any new insert we attempt throws the TimedOutException... exception.

This is the ring information in case it's useful

enter image description here
(the node with issues is the first one)

EDIT: The last lines in the log are as follows

INFO [OptionalTasks:1] 2013-02-05 10:12:59,140 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 92972117 bytes)  
INFO [OptionalTasks:1] 2013-02-05 10:12:59,141 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@916497516(74377694/92972117 serialized/live bytes, 141 ops)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,205 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 80689206 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,207 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@800272493(64551365/80689206 serialized/live bytes, 113 ops)
WARN [MemoryMeter:1] 2013-02-05 10:16:10,662 Memtable.java (line 197) setting live ratio to minimum of 1.0 instead of 0.0015255633589225548
INFO [MemoryMeter:1] 2013-02-05 10:16:10,663 Memtable.java (line 213) CFS(Keyspace='pics_persistent', ColumnFamily='master') liveRatio is 1.0 (just-counted was 1.0).  calculation took 38ms for 86 columns
INFO [OptionalTasks:1] 2013-02-05 10:16:33,267 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 71029403 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:16:33,269 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@143498560(56823523/71029403 serialized/live bytes, 108 ops)
INFO [ScheduledTasks:1] 2013-02-05 11:36:27,798 GCInspector.java (line 122) GC for ParNew: 243 ms for 1 collections, 1917768456 used; max is 3107979264
INFO [ScheduledTasks:1] 2013-02-05 13:00:54,090 GCInspector.java (line 122) GC for ParNew: 327 ms for 1 collections, 1966976760 used; max is 3107979264

Solution

I guess you're just overloading one of your nodes with writes - i.e. you write faster than it is capable to digest. This is pretty easy if your writes are huge.

The MutationStage is increasing even after you stopped writing to the cluster, because the other nodes are still processing queued mutation requests and send replicas to this overloaded node.

I don't know why one of the node gets overloaded, because there may be several reasons:

the node is slower than the others (different hardware or different configuration)
the cluster is not properly balanced (however, the beginning of your nodetool ring output suggests it is not the case)
you're directing all your writes to this particular node instead of distributing them to all nodes equally, e.g. by round-robin
you configured too big total memtables size limit / or cache sizes for too little total heap space and your nodes are struggling with GC and it just happened that this one was the first one to fall into the GC death spiral