Search code examples
ssherror-handlinggpuredhatslurm

srun: error: Slurm controller not responding, sleeping and retrying


Running the following command in Slurm:

$ srun -J FRD_gpu --partition=gpu --gres=gpu:1 --time=0-02:59:00 --mem=2000 --ntasks=1 --cpus-per-task=1 --pty /bin/bash -i

Returns the following error:

srun: error: Slurm controller not responding, sleeping and retrying.

The Slurm controller seems to be up:

$ scontrol ping
Slurmctld(primary) at narvi-install is UP 

Any idea why and how to resolve this?

$ scontrol -V
slurm 18.08.8

System info: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

$ sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal       up 7-00:00:00      1 drain* me99
normal       up 7-00:00:00      3  down* me[64-65,97]
normal       up 7-00:00:00      1  drain me89
normal       up 7-00:00:00     23    mix me[55,67,86,88,90-94,96,98,100-101],na[27,41-42,44-45,47-49,51-52]
normal       up 7-00:00:00     84  alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158],na[01-26,28-40,43,46,50,53-60]
normal       up 7-00:00:00      3   idle me[82,151-152]
test*        up    4:00:00      1 drain* me99
test*        up    4:00:00      3  down* me[64-65,97]
test*        up    4:00:00      2  drain me[04,89]
test*        up    4:00:00     27    mix me[55,67,86,88,90-94,96,98,100-101,248,260],meg[11-12],na[27,41-42,44-45,47-49,51-52]
test*        up    4:00:00    130  alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158,233-247,249-259,261-280],na[01-26,28-40,43,46,50,53-60]
test*        up    4:00:00     14   idle me[01-03,50-54,82,151-152],meg10,nag[01,14]
grid         up 7-00:00:00     10    mix na[27,41-42,44-45,47-49,51-52]
grid         up 7-00:00:00     42  alloc na[01-26,28-32,43,46,50,53-60]
gpu          up 7-00:00:00     15    mix meg[11-12],nag[02-10,12-13,16-17]
gpu          up 7-00:00:00      4   idle meg10,nag[01,11,15]

Solution

  • If you are positive the Slurm controller is up and running (for instance sinfo command is responding), SSH to the compute node that is allocated to your job and run scontrol ping to test connectivity to the master. If it fails, look for firewall rules blocking the connection from the compute node to the master.