Search code examples
cassandradocker-swarmspring-data-cassandra

Cassandra replicas on single docker-swarm node


I've a single docker-swarm manager node (18.09.6) running and I'm playing with spinning up a cassandra cluster. I'm using the following definition and it works in that the seed/master and slave spin up and communicate/replicate their data/schema changes fine:

services:
  cassandra-masters:
    image: cassandra:2.2
    environment:
      - MAX_HEAP_SIZE=128m
      - HEAP_NEWSIZE=32m
      - CASSANDRA_BROADCAST_ADDRESS=cassandra-masters
    deploy:
      mode: replicated
      replicas: 1
  cassandra-slaves:
    image: cassandra:2.2
    environment:
      - MAX_HEAP_SIZE=128m
      - HEAP_NEWSIZE=32m
      - CASSANDRA_SEEDS=cassandra-masters
      - CASSANDRA_BROADCAST_ADDRESS=cassandra-slaves
    deploy:
      mode: replicated
      replicas: 1
    depends_on:
      - cassandra-masters

When I change the replica count from 1 to 2, either on deployment of the stack or a post deploy scale, the second task for the cassandra slave is created, but constantly fails with an error indicating it cannot gossip with the seed node:

INFO  10:51:03 Loading persisted ring state
INFO  10:51:03 Starting Messaging Service on /10.10.0.200:7000 (eth0
INFO  10:51:03 Handshaking version with cassandra-masters/10.10.0.142
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds
java.lang.RuntimeException: Unable to gossip with any seeds
    at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360)
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521)
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310)
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548)
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657)
ERROR 10:51:34 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
    at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360) ~[apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521) ~[apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756) ~[apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676) ~[apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562) ~[apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310) [apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548) [apache-cassandra-2.2.14.jar:2.2.14]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657) [apache-cassandra-2.2.14.jar:2.2.14]

I'd like to understand what is causing the issue and whether there is a way to work-around it? I'm just investigating what any roadblocks are to getting to production where we'd obviously be spinning the cassandra tasks/replicas up on different nodes rather than the one node.

EDIT: I've spun the same stack up on a two node swarm and I'm seeing the same behaviour, i.e. when I scale to a second "slave" task, it fails with the same error, so it's not an issue particular to trying to run two tasks on the same node.


Solution

  • I've not gotten to the bottom of why the gossiping fails but ultimately we agreed a production deployment strategy where we'd not require auto-scaling and should instead be making capacity planning based on the system's behaviour and expected traffic. This answer also points out to the additional strain that auto-scaling can add to an already stretched system: AWS and auto scaling cassandra