Slurm: How to restart failed worker job

If one is running an array job on a slurm cluster, how can one restart a failed worker job?

In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?

Solution

You can use --requeue

#SBATCH --requeue                   ### On failure, requeue for another try

--requeue

Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

See more here: https://slurm.schedmd.com/sbatch.html#lbAE