Search code examples
dockerdocker-composedocker-swarmdocker-swarm-modeswarm

Number of replicas in swarm doesn't start in worker node (1/4)


I started a flask API service onto docker swarm cluster with 1 master and 3 worker node. I have deployed task using the following docker compose file,

version: '3'

services:
  xgboost-model-api:
image: xgboost-model-api
  ports:
    - "5000:5000"
deploy:
  mode: global
networks:
  - xgboost-net

networks:
   xgboost-net:

I deployed the task using the following docker swarm command,

docker stack deploy --compose-file docker-compose.yml xgboost-swarm

However, the task was started only on my master node and not on any worker node.

$ docker service ls
ID            NAME                             MODE        REPLICAS  IMAGE
pgd8cktr4foz  viz                              replicated  1/1       
dockersamples/visualizer
twrpr4av4c7f  xgboost-swarm_xgboost-model-api  global      1/4       xgboost-model-api
xxrfn1w7eqw6  dockercloud-server-proxy         global      1/1       dockercloud/server-proxy 

Dockerfile being used is here. Any thoughts on why this behavior occurs would be appreciated.


Solution

  • As stated in this thread (duplicate?):

    If you are using a private registry its important to share the login and credentials with the worker nodes by using

    docker stack deploy --with-registry-auth

    ---- UPDATE

    From your compose file it doesn't look like you are using a private registry. Generally speaking if containers can't start successfuly on the workers they will end up on the manager. Some possible reasons for this are:

    1. Can't access private registry (fix with --with-registry-auth)
    2. Application requires some change on the host to run (like elasticSearch requires vm.max_map_count=262144)
    3. HealthCheck fails on other node because of poorly written helthcheck
    4. Network setting issues preventing pulling an image

    Try removing your stack and running it again. Then do docker service ps --no-trunc {serviceName} this might show you tasks that should run the service on another node and why it failed.

    Check out this SO thread for more troubleshooting tips.