Running the following command in Slurm:
$ srun -J FRD_gpu --partition=gpu --gres=gpu:1 --time=0-02:59:00 --mem=2000 --ntasks=1 --cpus-per-task=1 --pty /bin/bash -i
Returns the following error:
srun: error: Slurm controller not responding, sleeping and retrying.
The Slurm controller seems to be up:
$ scontrol ping
Slurmctld(primary) at narvi-install is UP
Any idea why and how to resolve this?
$ scontrol -V
slurm 18.08.8
System info: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal up 7-00:00:00 1 drain* me99
normal up 7-00:00:00 3 down* me[64-65,97]
normal up 7-00:00:00 1 drain me89
normal up 7-00:00:00 23 mix me[55,67,86,88,90-94,96,98,100-101],na[27,41-42,44-45,47-49,51-52]
normal up 7-00:00:00 84 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158],na[01-26,28-40,43,46,50,53-60]
normal up 7-00:00:00 3 idle me[82,151-152]
test* up 4:00:00 1 drain* me99
test* up 4:00:00 3 down* me[64-65,97]
test* up 4:00:00 2 drain me[04,89]
test* up 4:00:00 27 mix me[55,67,86,88,90-94,96,98,100-101,248,260],meg[11-12],na[27,41-42,44-45,47-49,51-52]
test* up 4:00:00 130 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158,233-247,249-259,261-280],na[01-26,28-40,43,46,50,53-60]
test* up 4:00:00 14 idle me[01-03,50-54,82,151-152],meg10,nag[01,14]
grid up 7-00:00:00 10 mix na[27,41-42,44-45,47-49,51-52]
grid up 7-00:00:00 42 alloc na[01-26,28-32,43,46,50,53-60]
gpu up 7-00:00:00 15 mix meg[11-12],nag[02-10,12-13,16-17]
gpu up 7-00:00:00 4 idle meg10,nag[01,11,15]
If you are positive the Slurm controller is up and running (for instance sinfo
command is responding), SSH to the compute node that is allocated to your job and run scontrol ping
to test connectivity to the master. If it fails, look for firewall rules blocking the connection from the compute node to the master.