cassandra datastax-enterprise datastax-startup

Cassandra nodes cannot communicate with each other, cause ReadTimeout

This is on Datastax Cassandra (dse) version: 4.8.5-1
This corresponds (I believe) to Cassandra: 2.1.x

I'm getting a lot of the following errors when querying from our application:

ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'data_retrieved': False, 'required_responses': 1, 'consistency': 1}

Digging into this more; a sample query (run using cqlsh locally on each node) returns on 3 of the nodes in the ring but fails with a ReadTimeout on the rest. It seems like only the nodes containing the replicas return with a response, while the rest don't know how to find them at all.

Is there some configuration or known issue I should be looking at to fix this issue?

When the other nodes fail, I see this error in the logs:

ERROR [MessagingService-Outgoing-/10.0.10.14] 2016-04-25 20:46:46,818  CassandraDaemon.java:229 - Exception in thread Thread[MessagingService-Outgoing-/10.0.10.14,5,
main]
java.lang.AssertionError: 371205
        at org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUtil.java:290) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(AbstractCType.java:393) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(AbstractCType.java:382) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.java:271) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.java:259) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQueryFilter.java:503) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQueryFilter.java:490) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.SliceFromReadCommandSerializer.serialize(SliceFromReadCommand.java:168) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:143) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:132) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:121) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:330) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:282) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]
        at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:218) ~[cassandra-all-2.1.13.1131.jar:2.1.13.1131]

Nodetool status output

Datacenter: primary
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.0.10.224  557.95 GB  1       ?       d1b984b0-50d4-4faa-b349-08bc0cf36447  RAC1
UN  10.0.10.225  740.11 GB  1       ?       16ab3c8c-476e-46c2-837c-6dbb89b7d40d  RAC1
UN  10.0.10.12   748.23 GB  1       ?       4127f0d7-6bd0-4dc8-b6a0-3b261e55b44e  RAC1
UN  10.0.10.45   629.27 GB  1       ?       f4499c5d-f892-43b8-97f3-dcce5be51fb8  RAC2
UN  10.0.10.13   592.57 GB  1       ?       41b58044-942d-4e77-a8de-95495b88a073  RAC1
UN  10.0.10.14   616.45 GB  1       ?       d2b568fb-13e1-4ff7-a247-3751a8ca49cf  RAC1
UN  10.0.10.15   623.23 GB  1       ?       fb10e521-8359-409b-bfd8-b27829157a80  RAC1
UN  10.0.10.21   538.56 GB  1       ?       72288b4c-bd1d-4398-9d95-5af312c2f904  RAC2
UN  10.0.10.25   616.63 GB  1       ?       4a8f04ff-a198-44d1-baf4-72cc430cd8a9  RAC2
UN  10.0.10.218  562.98 GB  1       ?       c00c375d-90bb-48c5-a8d0-7102a13db468  RAC2
UN  10.0.10.219  632.58 GB  1       ?       1e2ea144-35bd-412b-89b5-41544a347a75  RAC2
UN  10.0.10.220  746.85 GB  1       ?       d40f59c1-430a-4d96-9d7e-1e846b8eb1fc  RAC2
UN  10.0.10.221  575.89 GB  1       ?       7e407d6b-2bd5-43b4-9116-96ee72a926b2  RAC2
UN  10.0.10.222  639.98 GB  1       ?       bfd04ab8-7679-4474-8d47-984950bdd2c7  RAC1
UN  10.0.10.223  652.58 GB  1       ?       6366cd3e-7910-40bb-8a12-926c53adf95b  RAC1

The code for this assertion is here:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.cassandra/cassandra-all/2.1.1/org/apache/cassandra/utils/ByteBufferUtil.java?av=f#290

There's no obvious schema mismatch when looking at either the system.local or system.peers tables.
nodetool describecluster returns UNREACHABLE from some nodes

Solution

You are probably hitting the 64K max key size limit, http://wiki.apache.org/cassandra/FAQ#max_key_size

Look for your application code, probably somebody sending cassandra 371205 byte long data as a primary key, maybe somebody trying to crack your application i don't know, because highly unlikely 370k data as primary key is sensible, restrict this in your application code,

I don't know if any bug or fix or workaround exists about this.