docker docker-compose docker-swarm slurm docker-stack

How do I fix this intermittent job completion failure in dockerized slurm?

I'm trying to build a fully dockerized deployment of slurm using docker stacks, but jobs don't complete consistently. Does anyone have any idea why this might be?

Other than this problem, the system works: All the nodes come up, I can submit jobs, and they run. The problem I am having is that some jobs don't complete properly. Right now it's running on a single-node swarm.

I can submit a bunch of them with:

salloc -t 1 srun sleep 10

and I can watch them with squeue. Some of them will complete after 10 seconds as expected but most of them keep running until they hit the 1-minute timeout from -t 1.

The system consists of five docker services:

slurm-stack_mysql
slurm-stack_slurmdbd
slurm-stack_slurmctld
slurm-stack_c1
slurm-stack_c2

c1 and c2 are the worker nodes. All five services run the same docker image (Dockerfile below) and are configured with the docker-compose.yml linked below.

Here are some things I've noticed and tried:

I based the Dockerfile and docker-compose.yml on a docker-compose-based version (i.e., without stacks or swarm). That versions works just fine -- jobs complete as usual. So it seems like it's something in the transition to Docker Stacks that's causing trouble. The original is here: https://github.com/giovtorres/slurm-docker-cluster
I noticed in the logs that slurmdbd was getting "Error connecting slurm stream socket at 10.0.2.6:6817: Connection refused" errors failure to an IP address that corresponded to the swarm load-balancer. I managed to get rid of these by declaring all the services as global deployments in docker-compose.yml. Other than eliminating the connection failures, it didn't seem to change anything. EDIT @chris-becke pointed out that I was mis-using global, so I've turned it off. No help, but the "connection refused" errors returned.
When I do host c2, host c1, or host <service> for any of the services in my system from inside one of the containers, I always get back two IP addresses. One of them corresponds to what I see in the containers section of docker network inspect slurm-stack_default. The other is one lower (e.g., 10.0.38.12 and 10.0.38.11). If I run ip addr in one of the containers, the ip address it reports matches what's listed for that host in the output of docker network inspect.

Configuration Files

Here are all the configuration files for the system:

Dockerfile: https://gist.github.com/stevenjswanson/b819ab3a68cc7d9aea72099263ef10bd
docker-compose.yml: https://gist.github.com/stevenjswanson/4b50e085385a0ffcb0d6ffed9186ed02
slurm.conf: https://gist.github.com/stevenjswanson/d8c48fcd6b19b504fda3a32c34227878
slurmdb.conf: https://gist.github.com/stevenjswanson/84b31b5ae793379f16eff16678f75b47
install_slurm.sh: https://gist.github.com/stevenjswanson/bcd04828dbc69eb25acd48c3d4c8ef31
docker-entrypoint.sh: https://gist.github.com/stevenjswanson/0b3650a123fd93f54a1fd9b973ed2e65

I start it with docker stack deploy -c docker-compose.yml slurm-stack.

Example Logs

These are representative logs for when job is not finishing consistently. In this case, jobs 2 (running on c2) and 3 (running on c1) don't complete correctly, but job 1 (running on c1) does.

slurmctld logs: https://gist.github.com/stevenjswanson/67ca4c76bc00200d52b2d05ab7bfb422
slurmdbd logs: https://gist.github.com/stevenjswanson/b49d9571dbf6b9160555db3a0867410f
c1 logs: https://gist.github.com/stevenjswanson/fab9ce8510804919fafe36804fd417f6
c2 logs: https://gist.github.com/stevenjswanson/dd03f5bdf77851115086801691410099
mysql logs: https://gist.github.com/stevenjswanson/d7cfb82adde9c260ea4673e2037363d1

Software Version Info

Slurm version info:

$ sinfo -V
slurm-wlm 21.08.5

Docker version information:

$ docker version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.17.3
 Git commit:        20.10.12-0ubuntu4
 Built:             Mon Mar  7 17:10:06 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.22
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.9
  Git commit:       42c8b31
  Built:            Thu Dec 15 22:25:49 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.14
  GitCommit:        9ba4b250366a5ddde94bb7c9d1def331423aa323
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Linux version:

$ uname -a
Linux slurmctld 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Edit: For good measure, I rebuilt everything on a brand new cloud instance with the latest docker (24.0.5) and kernel (5.15.0-78). The results are the same.

Solution

docker creates a VIP or virtual IP associated with each service. This VIP will, in the case that multiple tasks exist, load balance between the healthy tasks. It also ensures that consumers are not effected by IP changes when tasks restart.

Each task container gets its own IP. Normally consumers are insulated from this duality: The service name is associated with the VIP, and tasks.<service> is the dnsrr entry associated with the 0, 1, or more IPS associated with each container.

However, docker also registers the hostname in its internal dns, and here steps in a frequent antipattern that refuses to die: Lots of compose files, for no reason at all, just love to declare a hostname the same as the service name.

This, as you have found, can have weird unintended side effects as now the hostname AND service name both resolve, resulting in a dnsrr that returns both the vip and task ip, where really you just want one response.

In dnsrr mode, and in compose deployments, Docker does not create VIPS, but simply registers each service task IP. So the service and hostname get registered with the same IP, and consequently c1 and c2 always resolve to a single IP.

Nonetheless, the fix should be to remove the service.hostname: entries and switch back to vip mode as it will probably confuse the broker if a worker starts work on one ip, gets restarted for some reason, and finishes on a different ip.