Search code examples
cluster-computinghpcslurm

SLURM requeue with new JOBID


Is it possible to set some requeue options so that JOBID is changed when slurm decides to requeue a job. (after a node failure, for instance) So that the folder associated to first JOBID is not overwritten.

Thanks,


Solution

  • A requeued job is still the same job, so the job ID will not change.

    What you can do is prevent requeuing with the --no-requeue. But then you will need to re-submit the job, either by hand or using a workflow manager.

    Another option, is to append the restart count to the folder name. For instance, if your submission script has a line such as

    WORKDIR=/some/path/${SLURM_JOB_ID}
    mkdir -p $WORKDIR
    cd $WORKDIR
    

    you can replace it with

    mkdir -p /some/path/${SLURM_JOB_ID}${SLURM_RESTART_COUNT}
    mkdir -p $WORKDIR
    cd $WORKDIR
    

    Upon first run, the $SLURM_RESTART_COUNT will be unset, leaving the original behaviour, but then, it will be set to 1, 2, and so on, effectively suffixing the job ID with the requeue number.

    For the name of the output file, you can use --open-mode=append to avoir overwriting the output file when the job restarts.