Search code examples
slurmhostname

SLURM not valid controller


in my master node the slurmctld is working, while in all other compute nodes fail with this error:

slurmctld[1747]: slurmctld: error: This host (hostname/hostname) not a valid controller

The cluster apparently is working. Do you have any advice to understand what is and fix it?

Thanks

PS. some new info: the SLURM version installed in the master node is the same installed in the compute nodes. /etc/hosts file seems ok in all nodes. The hostname name reported in the "error" is the same reported in the node "hostname".


Solution

  • The Slurm controller (slurmctld service) should not run on the compute nodes, only on the management node(s).

    The compute nodes must only run the slurmd service.