Search code examples
hpcslurmsungridengine

Slurm: How to restart failed worker job


If one is running an array job on a slurm cluster, how can one restart a failed worker job?

In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?


Solution

  • You can use --requeue

    #SBATCH --requeue                   ### On failure, requeue for another try
    

    --requeue

    Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

    See more here: https://slurm.schedmd.com/sbatch.html#lbAE