If one is running an array job on a slurm cluster, how can one restart a failed worker job?
In a Sun Grid Engine queue, one can add #$ -r y
to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?
You can use --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue
Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
See more here: https://slurm.schedmd.com/sbatch.html#lbAE