Search code examples
cassandraopen-sourcecassandra-3.0

Cassandra 3.0.5 nodes fail to startup with "IllegalStateException: One row required, 2 found"


I have run into a horrible situation on one of my cassandra clusters. The version the cluster is on is 3.0.5. I am running a 2 DC setup with close 30 nodes, 18 in one DC and the rest in the other. I did everything possible with my knowledge, but still looking for answers.

Of late we were having a few issues with respect to GC pauses, a few turnings were done on the jvm(MAX_HEAP_SIZE was changed) on all nodes and the cluster was ready for the rolling restart to take effect.

The 1st node went through well with the rolling restart, but the 2nd node just did not comeback up after the shut down. And the error below.

INFO  07:45:34 Initializing system_schema.keyspaces
INFO  07:45:34 Initializing system_schema.tables
INFO  07:45:34 Initializing system_schema.columns
INFO  07:45:34 Initializing system_schema.triggers
INFO  07:45:34 Initializing system_schema.dropped_columns
INFO  07:45:34 Initializing system_schema.views
INFO  07:45:34 Initializing system_schema.types
INFO  07:45:34 Initializing system_schema.functions
INFO  07:45:34 Initializing system_schema.aggregates
INFO  07:45:34 Initializing system_schema.indexes
Exception (java.lang.IllegalStateException) encountered during startup: One row required, 2 found
java.lang.IllegalStateException: One row required, 2 found
    at org.apache.cassandra.cql3.UntypedResultSet$FromResultSet.one(UntypedResultSet.java:84)
    at org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:948)
    at org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:938)
    at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:901)
    at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:878)
    at org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:866)
    at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:134)
    at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:124)
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:229)
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:551)
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:679)

after running repairs on the cluster, and specifically on the system keyspaces, the error still persisted. When the node did not comeup eventually, i had it removed from the cluster, using the nodetool removenode command from a healthy node.

Again, another node was taken up for restart in the same cluster and datacenter, again it did not come back up, with the same error.

I was also unable to login to the cqlsh shell from a healthy node, with the below error

Connection error: ('Unable to connect to any servers', {'<<VM hostname>>': UnicodeDecodeError('utf8', '\x7f\x00\x00\x80C\x02', 3, 4, 'invalid start byte')})

This error also was seen on a few other nodes

Connection error: ('Unable to connect to any servers', {'<<VM Hostname>>': ConnectionShutdown("'utf8' codec can't decode byte 0x80 in position 3: invalid start byte",)})

Essentially, the cluster has nothing working except the nodetool commands.

When i ran a nodetool describe cluster, i saw 5 different schema versions for various nodes and also saw some 9 nodes as unreachable, below is the output

./nodetool describecluster
Cluster Information:
    Name: Dummy cluster
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        1590ea6a-8c19-342a-8269-204c64a12176: [9 nodes here]
        668d9efd-13c1-3fb3-9b89-7fc07d9ddf0b: [1 node here]
        d20dc0de-dd34-3183-b459-31e3feb8f118: [3 nodes here]
        3ec9610c-d241-3215-84f2-2413b8cad7d2: [7 nodes here]
        59adb24e-f3cd-3e02-97f0-5b395827453f: [1 node here]
        UNREACHABLE: [9 nodes unreachable]

Can someone pls help in understanding what the issue could be and also a way to bring the nodes back up? I also tried the ignore schema mismatch flag in cassandra-env.sh/jvm.options to bring the node up, but that did not help as well.


Solution

  • You may very well be affected by the same issue reported in CASSANDRA-11900 which was ultimately fixed by CASSANDRA-12144. You could try bringing down the nodes and moving to a more stable release of Cassandra 3.0, 3.0.28 is the latest available: https://www.apache.org/dyn/closer.lua/cassandra/3.0.28/apache-cassandra-3.0.28-bin.tar.gz

    Because the stack in the log shows there is trouble migrating the system_schema tables, if you can't start Cassandra with the newer version, then you could try sstablescrub in the newer version before starting.

    With the number of nodes unreachable, hopefully a newer version can get you past this and you can rolling-restart the cluster starting with the seed nodes to fix the schema version, otherwise, you'll need to save a 'nodetool ring' output from an online node, then strip out the initial token values for the nodes that are on non-matching schema versions and go through the process of bootstrapping the nodes back to the cluster and populating user created tables with 'nodetool refresh', then once everything is back, repairing the cluster. Those steps can be laid out if it comes to it, but hopefully an upgrade can get you past this.