Search code examples
dockerdocker-composedocker-swarmslurmdocker-stack

How do I fix this intermittent job completion failure in dockerized slurm?


I'm trying to build a fully dockerized deployment of slurm using docker stacks, but jobs don't complete consistently. Does anyone have any idea why this might be?

Other than this problem, the system works: All the nodes come up, I can submit jobs, and they run. The problem I am having is that some jobs don't complete properly. Right now it's running on a single-node swarm.

I can submit a bunch of them with:

salloc -t 1 srun sleep 10

and I can watch them with squeue. Some of them will complete after 10 seconds as expected but most of them keep running until they hit the 1-minute timeout from -t 1.

The system consists of five docker services:

  • slurm-stack_mysql
  • slurm-stack_slurmdbd
  • slurm-stack_slurmctld
  • slurm-stack_c1
  • slurm-stack_c2

c1 and c2 are the worker nodes. All five services run the same docker image (Dockerfile below) and are configured with the docker-compose.yml linked below.

Here are some things I've noticed and tried:

  1. I based the Dockerfile and docker-compose.yml on a docker-compose-based version (i.e., without stacks or swarm). That versions works just fine -- jobs complete as usual. So it seems like it's something in the transition to Docker Stacks that's causing trouble. The original is here: https://github.com/giovtorres/slurm-docker-cluster

  2. I noticed in the logs that slurmdbd was getting "Error connecting slurm stream socket at 10.0.2.6:6817: Connection refused" errors failure to an IP address that corresponded to the swarm load-balancer. I managed to get rid of these by declaring all the services as global deployments in docker-compose.yml. Other than eliminating the connection failures, it didn't seem to change anything. EDIT @chris-becke pointed out that I was mis-using global, so I've turned it off. No help, but the "connection refused" errors returned.

  3. When I do host c2, host c1, or host <service> for any of the services in my system from inside one of the containers, I always get back two IP addresses. One of them corresponds to what I see in the containers section of docker network inspect slurm-stack_default. The other is one lower (e.g., 10.0.38.12 and 10.0.38.11). If I run ip addr in one of the containers, the ip address it reports matches what's listed for that host in the output of docker network inspect.

Configuration Files

Here are all the configuration files for the system:

I start it with docker stack deploy -c docker-compose.yml slurm-stack.

Example Logs

These are representative logs for when job is not finishing consistently. In this case, jobs 2 (running on c2) and 3 (running on c1) don't complete correctly, but job 1 (running on c1) does.

Software Version Info

Slurm version info:

$ sinfo -V
slurm-wlm 21.08.5

Docker version information:

$ docker version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.17.3
 Git commit:        20.10.12-0ubuntu4
 Built:             Mon Mar  7 17:10:06 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.22
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.9
  Git commit:       42c8b31
  Built:            Thu Dec 15 22:25:49 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.14
  GitCommit:        9ba4b250366a5ddde94bb7c9d1def331423aa323
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Linux version:

$ uname -a
Linux slurmctld 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Edit: For good measure, I rebuilt everything on a brand new cloud instance with the latest docker (24.0.5) and kernel (5.15.0-78). The results are the same.


Solution

  • docker creates a VIP or virtual IP associated with each service. This VIP will, in the case that multiple tasks exist, load balance between the healthy tasks. It also ensures that consumers are not effected by IP changes when tasks restart.

    Each task container gets its own IP. Normally consumers are insulated from this duality: The service name is associated with the VIP, and tasks.<service> is the dnsrr entry associated with the 0, 1, or more IPS associated with each container.

    However, docker also registers the hostname in its internal dns, and here steps in a frequent antipattern that refuses to die: Lots of compose files, for no reason at all, just love to declare a hostname the same as the service name.

    This, as you have found, can have weird unintended side effects as now the hostname AND service name both resolve, resulting in a dnsrr that returns both the vip and task ip, where really you just want one response.

    In dnsrr mode, and in compose deployments, Docker does not create VIPS, but simply registers each service task IP. So the service and hostname get registered with the same IP, and consequently c1 and c2 always resolve to a single IP.

    Nonetheless, the fix should be to remove the service.hostname: entries and switch back to vip mode as it will probably confuse the broker if a worker starts work on one ip, gets restarted for some reason, and finishes on a different ip.