Search code examples
hpcslurm

SLURM: flag to auto-requeue jobs cancelled due to preemption?


I'm running the following job array on SLURM:

#!/bin/bash

#SBATCH --array=1-1000
#SBATCH --partition=scavenge
#SBATCH --mem=2g
#SBATCH --time=1:00:00

module load Python/3.6.4-iomkl-2018a
python run.py ${SLURM_ARRAY_TASK_ID}

Many of my jobs error out with:

slurmstepd: error: *** JOB 63830645 ON p08r06n17 CANCELLED AT 2020-08-18T21:40:52 DUE TO PREEMPTION ***

I'd like to requeue those jobs automatically if they're preempted. Is it possible to do so? Any pointers on this query would be very appreciated!


Solution

  • This depends on how your cluster is set up. Preemption is handled by the PreemptMode Option. If that is set to 'requeue', jobs are requeued, if either srun/sbatch parameter requeue was given, or JobRequeue is set to 1 (see output of scontrol show config).

    You can add the requeue parameter to your job script as follows:

    #!/bin/bash
    
    #SBATCH --requeue
    #SBATCH --array=1-1000
    ...
    

    Or you can pass the requeue flag when submitting your job:

    sbatch --requeue run.job
    

    If that is not the case on your cluster, then you still might be able to work around this: The default KillWait time is 30 seconds. Once your job is getting terminated (for any reason), There is a 30 second delay between the SIGTERM and SIGKILL signals. So you can trap the sigterm signal and requeue your job manually, e.g.:

    #!/bin/bash
    
    #SBATCH --array=1-1000
    #SBATCH --partition=scavenge
    #SBATCH --mem=2g
    #SBATCH --time=1:00:00
    
    trap 'scontrol requeue ${SLURM_JOB_ID}; exit 15' 15 
    module load Python/3.6.4-iomkl-2018a
    python run.py ${SLURM_ARRAY_TASK_ID} &
    wait
    

    This requeues the job as soon as a SIGTERM arrives. Downside: If you want to properly cancel this job, you'll need to use scancel -9 <jobid>, as the default signal sent by scancel is SIGTERM.