I am calling scontrol reboot <nodename>
to reboot compute nodes in my SLURM cluster.
The reboot usually times out (seen from SLURM) and the node is set to state "DOWN". (RESUME_TIMEOUT is set to 300).
This presumably happens because the slurmd
service does not autostart itself after boot.
By default, the service is "disabled":
[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Activating it using systemctl enable slurmd
does not last after the next reboot, the service is again "disabled" then.
I assume this is because the change does not happen in the image which is used for booting.
How can I enable the slurmd
service on the computes so that it starts on boot and scontrol reboot
works?
I got a reply from Antanas Budriūnas via the OpenHPC mailing list which solved the issue.
(execute on master node)
# chroot /<path>/<to>/<cnode>/<image>
# systemctl enable slurmd
# exit