cluster-computing scheduler hpc slurm supercomputers

Slurm: how many times will failed jobs be --requeue'd

I have a Slurm job array for which the job file includes a --requeue directive. Here is the full job file:

#!/bin/bash
#SBATCH --job-name=catsss
#SBATCH --output=logs/cats.log
#SBATCH --array=1-10000
#SBATCH --requeue
#SBATCH --partition=scavenge
#SBATCH --mem=32g
#SBATCH --time=24:00:00
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=douglas.duhaime@gmail.com
module load Langs/Python/3.4.3
python3 cats.py ${SLURM_ARRAY_TASK_ID} 'cats'

Several of the array values have restarted at least once. I would like to know, how many times will these jobs restart before they are finally cancelled by the scheduler? Will the restarts carry on indefinitely until a sysadmin manually cancels them, or do jobs like this have a maximum number of retries?

Solution

AFAIK, the jobs can be requeued in infinite occasions. You just decide if the job is prepared to be requeued or not. If not-requeue, then it will never be requeued. If requeue, then it will be requeued everytime the system decides it is needed (node failure, higher priority job preemption...).

The jobs keep restarting until they finish (successfully or not, but finished instead of interrupted).