Cassandra AssertionError

I received an OOM exception at one point in Cassandra. Mine is a single instance running on a modestly powered server, and i was doing some load testing, so no surprise there.

But, i have subsequently been unable to use the instance. When i list the keyspaces, only "system" is shown. But when i try to recreate the keyspace i was testing with, Hector responds with the dreaded "All host pools marked down. Retry burden pushed out to client." message, and the Cassandra log has the following stack trace:

ERROR [MigrationStage:1] 2012-04-27 20:47:00,863 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[MigrationStage:1,5,main]
java.lang.AssertionError
    at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441)
    at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339)
    at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269)
    at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
ERROR [Thrift:9] 2012-04-27 20:47:00,864 CustomTThreadPoolServer.java (line 204) Error occurred during processing of message.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError
    at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:372)
    at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:191)
    at org.apache.cassandra.service.MigrationManager.announceNewKeyspace(MigrationManager.java:129)
    at org.apache.cassandra.thrift.CassandraServer.system_add_keyspace(CassandraServer.java:987)
    at org.apache.cassandra.thrift.Cassandra$Processor$system_add_keyspace.getResult(Cassandra.java:3370)
    at org.apache.cassandra.thrift.Cassandra$Processor$system_add_keyspace.getResult(Cassandra.java:3358)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
    at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.AssertionError
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
    at java.util.concurrent.FutureTask.get(FutureTask.java:83)
    at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:368)
    ... 11 more
Caused by: java.lang.AssertionError
    at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441)
    at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339)
    at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269)
    at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    ... 3 more

The old keyspace was still in the data dir, so i moved it, but that didn't help. It seems that the system data still has an invalid reference somewhere. Does anyone know how to fix this?

Edit: from the CLI, a "describe cluster;" only describes the "system" keyspace. But when i "use system;" and then "list schema_keyspaces;" the following is displayed:

Using default limit of 100
-------------------
RowKey: mango
=> (column=durable_writes, value=true, timestamp=29127788177516974)
=> (column=name, value=mango, timestamp=29127788177516974)
=> (column=strategy_class, value=org.apache.cassandra.locator.SimpleStrategy, timestamp=29127788177516974)
=> (column=strategy_options, value={"replication_factor":"1"}, timestamp=29127788177516974)

1 Row Returned.
Elapsed time: 1107 msec(s).

"mango" is the keyspace that i can no longer access, but it is still in there to some degree. Is there any way to fix it?

Solution

This problem is due to inconsistency and you can go for following steps.

1) In your case this is OK to clear "data" , "saved_caches" and "commitlog" directories as you dont have any critical data and other Keyspaces.

2) In the scenarios where you have some critical data and you can not delete above mentioned directories do the following.

Use nodetool drain to Empty the commitlog on all the nodes of the cluster.
Then delete all the "LocationInfo*" files from "/data/system" directories and restart the cluster.