Search code examples

SLURM: flag to auto-requeue jobs cancelled due to preemption?

I'm running the following job array on SLURM:


#SBATCH --array=1-1000
#SBATCH --partition=scavenge
#SBATCH --mem=2g
#SBATCH --time=1:00:00

module load Python/3.6.4-iomkl-2018a

Many of my jobs error out with:

slurmstepd: error: *** JOB 63830645 ON p08r06n17 CANCELLED AT 2020-08-18T21:40:52 DUE TO PREEMPTION ***

I'd like to requeue those jobs automatically if they're preempted. Is it possible to do so? Any pointers on this query would be very appreciated!


  • This depends on how your cluster is set up. Preemption is handled by the PreemptMode Option. If that is set to 'requeue', jobs are requeued, if either srun/sbatch parameter requeue was given, or JobRequeue is set to 1 (see output of scontrol show config).

    You can add the requeue parameter to your job script as follows:

    #SBATCH --requeue
    #SBATCH --array=1-1000

    Or you can pass the requeue flag when submitting your job:

    sbatch --requeue run.job

    If that is not the case on your cluster, then you still might be able to work around this: The default KillWait time is 30 seconds. Once your job is getting terminated (for any reason), There is a 30 second delay between the SIGTERM and SIGKILL signals. So you can trap the sigterm signal and requeue your job manually, e.g.:

    #SBATCH --array=1-1000
    #SBATCH --partition=scavenge
    #SBATCH --mem=2g
    #SBATCH --time=1:00:00
    trap 'scontrol requeue ${SLURM_JOB_ID}; exit 15' 15 
    module load Python/3.6.4-iomkl-2018a
    python ${SLURM_ARRAY_TASK_ID} &

    This requeues the job as soon as a SIGTERM arrives. Downside: If you want to properly cancel this job, you'll need to use scancel -9 <jobid>, as the default signal sent by scancel is SIGTERM.