Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

Issue

Current behavior and problem description

When a node fails to POWER_UP, it is marked DOWN. While this is a great idea in general, this is not useful when working with CLOUD nodes, because said CLOUD node is likely to be started on a different machine and therefore to POWER_UP without issues. But since the node is marked as down, that cloud resource is no longer used and never started again until freed manually.

Wanted behavior

Ideally slurm would not mark the node as DOWN, but just attempt to start another. If that's not possible, automatically resuming DOWN nodes would also be an option.

Question

How can I prevent slurm from marking nodes that fail to POWER_UP as DOWN or make slurm restore DOWN nodes automatically to prevent slurm from forgetting cloud resources?

Attempts and Thoughts

ReturnToService

I tried solving this using ReturnToService but that didn't seem to solve my issue, since, if I understand it correctly, that will only accept slurm nodes starting up by themselves or manually not taking them in consideration when scheduling jobs until they've been started.

ResumeFailedProgram

I considered using ResumeFailedProgram, but it sounds odd that you have to write yourself a script for returning your nodes to service.

SlurmctldParameters=idle_on_node_suspend

While this is great and definitely helpful, it doesn't solve the issue at hand since a node that failed during power up, is not suspended.

Additional Information

In the POWER_UP script I am terminating the server if the setup fails for any reason and return an exit code unequal to 0.

In our Cloud Scheduling instances are created once they are needed and deleted once they are no longer deleted. This means that slurm stores that a node is DOWN while no real instance behind it exists anymore. If that node wouldn't be marked DOWN and a job would be scheduled towards it at a later time, it would simply start an instance and run on that new instance. I am just stating this to be maximum explicit.

Solution

ResumeFailProgram is exactly the way to solve it. You're hitting the mismatch between slurm's view of "nodes" and what is actually providing these in a cloud environment. In this situation probably the ResumeFailProgram should use the cloud APIs to check whether it actually was likely to be a transient provisioning failure (e.g. some errors like out-of-quota, lost networking might need intervention rather than retrying, and staying DOWN is probably the right thing to do), then if so, just run scontrol update nodename=whatever state=resume to make that "node" (really, just a nodename here) eligible for powersaving to try and "resume" it again.