When a node fails to POWER_UP
, it is marked DOWN
. While this is a great idea in general, this is not useful when working with CLOUD
nodes, because said CLOUD
node is likely to be started on a different machine and therefore to POWER_UP
without issues. But since the node is marked as down, that cloud resource is no longer used and never started again until freed manually.
Ideally slurm would not mark the node as DOWN
, but just attempt to start another. If that's not possible, automatically resuming DOWN
nodes would also be an option.
How can I prevent slurm from marking nodes that fail to POWER_UP
as DOWN
or make slurm restore DOWN
nodes automatically to prevent slurm from forgetting cloud resources?
I tried solving this using ReturnToService
but that didn't seem to solve my issue, since, if I understand it correctly, that will only accept slurm nodes starting up by themselves or manually not taking them in consideration when scheduling jobs until they've been started.
I considered using ResumeFailedProgram
, but it sounds odd that you have to write yourself a script for returning your nodes to service.
While this is great and definitely helpful, it doesn't solve the issue at hand since a node that failed during power up, is not suspended.
In the POWER_UP
script I am terminating the server if the setup fails for any reason and return an exit code unequal to 0.
In our Cloud Scheduling instances are created once they are needed and deleted once they are no longer deleted. This means that slurm stores that a node is DOWN
while no real instance behind it exists anymore. If that node wouldn't be marked DOWN
and a job would be scheduled towards it at a later time, it would simply start an instance and run on that new instance. I am just stating this to be maximum explicit.
ResumeFailProgram
is exactly the way to solve it. You're hitting the mismatch between slurm's view of "nodes" and what is actually providing these in a cloud environment. In this situation probably the ResumeFailProgram should use the cloud APIs to check whether it actually was likely to be a transient provisioning failure (e.g. some errors like out-of-quota, lost networking might need intervention rather than retrying, and staying DOWN is probably the right thing to do), then if so, just run scontrol update nodename=whatever state=resume
to make that "node" (really, just a nodename here) eligible for powersaving to try and "resume" it again.