Search code examples
dockerdocker-composedocker-swarmdocker-machine

How to get my machine back to swarm manager status?


I have two AWS instances:

production-01 docker-machine-master

I ssh into docker-machine-master and run docker stack deploy -c deploy/docker-compose.yml --with-registry-auth production and i get this error:

this node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again

My guess is the swarm manager went down at some point and this new instance spun up some how keeping the same information/configuration minus the swarm manager info. Maybe the internal IP changed or something. I'm making that guess because the launch times are different by months. The production-01 instance was launched 6 months earlier. I wouldn't know because I am new to AWS, Docker, & this project.

I want to deploy code changes to the production-01 instance but I don't have ssh keys to do so. Also, my hunch is that production-01 is a replica noted in the docker-compose.yml file.

I'm the only dev on this project so any help would be much appreciated.

Here's a copy of my docker-compose.yml file with names changed.

version: '3' services: database: image: postgres:10 environment: - POSTGRES_USER=user - POSTGRES_PASSWORD=pass deploy: replicas: 1 volumes: - db:/var/lib/postgresql/data aservicename: image: 123.456.abc.amazonaws.com/reponame ports: - 80:80 depends_on: - database environment: DB_HOST: database DATA_IMPORT_BUCKET: some_sql_bucket FQDN: somedomain.com DJANGO_SETTINGS_MODULE: name.settings.production DEBUG: "true" deploy: mode: global logging: driver: awslogs options: awslogs-group: aservicename cron: image: 123.456.abc.amazonaws.com/reponame depends_on: - database environment: DB_HOST: database DATA_IMPORT_BUCKET: some_sql_bucket FQDN: somedomain.com DOCKER_SETTINGS_MODULE: name.settings.production deploy: replicas: 1 command: /name/deploy/someshellfile.sh logging: driver: awslogs options: awslogs-group: cron networks: default: driver: overlay ipam: driver: default config: - subnet: 192.168.100.0/24 volumes: db: driver: rexray/ebs


Solution

  • I'll assume you only have the one manager, and the production-01 is a worker.

    If docker info shows Swarm: inactive and you don't have backups of the Swarm raft log, then you'll need to create a new swarm with docker swarm init.

    Be sure it has the rexray/ebs driver by checking docker plugin ls. All nodes will need that plugin driver to use the db volume.

    If you can't SSH to production-01 then there will be no way to have it leave and join the new swarm. You'd need to deploy a new worker node and shutdown that existing server.

    Then you can docker stack deploy that app again and it should reconnect the db volume.

    Note 1: Don't redeploy the stack on new servers if it's still running on the production-01 worker, as it would fail because the ebs volume for db will still be connected to production-01.

    Note 2: It's best in anything beyond learning, you run three managers (managers are also workers by default). That way if one node gets killed, you still have a working solution.