Search code examples
systemdslurm

Autostart `slurmd` service on computes after reboot


I am calling scontrol reboot <nodename> to reboot compute nodes in my SLURM cluster.

The reboot usually times out (seen from SLURM) and the node is set to state "DOWN". (RESUME_TIMEOUT is set to 300).

This presumably happens because the slurmd service does not autostart itself after boot.
By default, the service is "disabled":

[root@c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Activating it using systemctl enable slurmd does not last after the next reboot, the service is again "disabled" then.
I assume this is because the change does not happen in the image which is used for booting.

How can I enable the slurmd service on the computes so that it starts on boot and scontrol reboot works?


Solution

  • I got a reply from Antanas Budriūnas via the OpenHPC mailing list which solved the issue.

    (execute on master node)
    # chroot /<path>/<to>/<cnode>/<image>
    # systemctl enable slurmd
    # exit