I've a single docker-swarm manager node (18.09.6) running and I'm playing with spinning up a cassandra cluster. I'm using the following definition and it works in that the seed/master and slave spin up and communicate/replicate their data/schema changes fine:
services:
cassandra-masters:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_BROADCAST_ADDRESS=cassandra-masters
deploy:
mode: replicated
replicas: 1
cassandra-slaves:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_SEEDS=cassandra-masters
- CASSANDRA_BROADCAST_ADDRESS=cassandra-slaves
deploy:
mode: replicated
replicas: 1
depends_on:
- cassandra-masters
When I change the replica count from 1 to 2, either on deployment of the stack or a post deploy scale, the second task for the cassandra slave is created, but constantly fails with an error indicating it cannot gossip with the seed node:
INFO 10:51:03 Loading persisted ring state
INFO 10:51:03 Starting Messaging Service on /10.10.0.200:7000 (eth0
INFO 10:51:03 Handshaking version with cassandra-masters/10.10.0.142
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657)
ERROR 10:51:34 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657) [apache-cassandra-2.2.14.jar:2.2.14]
I'd like to understand what is causing the issue and whether there is a way to work-around it? I'm just investigating what any roadblocks are to getting to production where we'd obviously be spinning the cassandra tasks/replicas up on different nodes rather than the one node.
EDIT: I've spun the same stack up on a two node swarm and I'm seeing the same behaviour, i.e. when I scale to a second "slave" task, it fails with the same error, so it's not an issue particular to trying to run two tasks on the same node.
I've not gotten to the bottom of why the gossiping fails but ultimately we agreed a production deployment strategy where we'd not require auto-scaling and should instead be making capacity planning based on the system's behaviour and expected traffic. This answer also points out to the additional strain that auto-scaling can add to an already stretched system: AWS and auto scaling cassandra