Search code examples
cassandranodetoolgossip

Some Cassandra nodes show DN - gossip information is not unanimous


I have a 16-node Cassandra cluster (3.11.9) with 3 seed nodes (.54, .115 and 164), replication factor 3 and gc_grace_seconds 10 days(default). Some nodes show DN on some other nodes but on other nodes they show up as UN. For example below is the nodetool status from the .54 and the .115 nodes:

.54 54

and .115 115

while for example on .87 every node is UN. This is happening for at least a couple of weeks now, and it started from two nodes that were showing each other down, the .54 and .147. However, it seems it expanded right now more and more nodes show DN on some nodes(but not on all). Just to also add that there were no writes these weeks.

  1. I have tried enabling, disabling the gossip and restarting cassandra on all nodes. Generation stamp is up to date in system_auth table. I can connect to these nodes with cqlsh but, as expected, in some cases I get NoHostAvailable because some data are located on the "dead" nodes.

  2. nodetool describecluster shows the DN nodes to be Unreachable, depending on which node I am executing it. So i.e. the .54 shows the .164,.115,.147 and .19 as Unreachable.

  3. Also in nodetool gossipinfo everything looks ok with status: normal and up-to-date generation.

  4. In the debug.log file I only get:

DEBUG [MessagingService-Outgoing-/192.168.100.147-Gossip] 2022-05-30 03:58:43,478 OutboundTcpConnection.java:546 - Unable to connect to /192.168.100.147
java.net.NoRouteToHostException: No route to host
    at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_312]
    at sun.nio.ch.Net.connect(Net.java:482) ~[na:1.8.0_312]
    at sun.nio.ch.Net.connect(Net.java:474) ~[na:1.8.0_312]
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:647) ~[na:1.8.0_312]
    at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~[apache-cassandra-3.11.9.jar:3.11.9]
    at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~[apache-cassandra-3.11.9.jar:3.11.9]
    at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:434) [apache-cassandra-3.11.9.jar:3.11.9]
    at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-cassandra-3.11.9.jar:3.11.9]
  1. In the system.log it actually has logged for all the nodes as "Node...has restarted, now UP" and "Node...state jump to Normal". However I also noticed this, which may has nothing to do:
WARN  [GossipStage:1] 2022-05-30 07:22:06,164 Gossiper.java:1693 - \
  Received an ack from /192.168.100.127, who isn't a seed.
  Ensure your seed list includes a live node. Exiting shadow round

Is there any way to understand what is happening and why is this happening? Do I miss something?

Please let me know if you need any more information.


Solution

  • This looks like a classic networking issue to me where the nodes are unable to gossip with each other because there's no connectivity on the internode port (default is 7000). The debug message you posted clearly states the cause:

    java.net.NoRouteToHostException: No route to host
    

    You need to check that there are no firewalls like iptables or firewalld blocking the traffic on port 7000, otherwise the nodes can't talk to each other.

    It is simple enough to test it using Linux tools such as telnet or nc. For example, run this command on node .54:

    $ telnet 192.168.100.115 7000
    

    If you get a "connection refused" error, it means that one of the following is true:

    • there's no network route to the node,
    • the traffic to the default gossip port 7000 is blocked, or
    • gossip is configured on another port (check storage_port in cassandra.yaml

    But in my experience, the most likely cause is that traffic is blocked by a firewall. Cheers!