in my master node the slurmctld is working, while in all other compute nodes fail with this error:
slurmctld[1747]: slurmctld: error: This host (hostname/hostname) not a valid controller
The cluster apparently is working. Do you have any advice to understand what is and fix it?
Thanks
PS. some new info: the SLURM version installed in the master node is the same installed in the compute nodes. /etc/hosts file seems ok in all nodes. The hostname name reported in the "error" is the same reported in the node "hostname".
The Slurm controller (slurmctld
service) should not run on the compute nodes, only on the management node(s).
The compute nodes must only run the slurmd
service.