Search code examples
slurm

How to prevent slurm from terminating entire job when last process fails or is killed?


I am launching 4 process via job script. These 4 processes launch successfully on the slurm host. But if the last process(4th one) crashed or get killed, the whole job get terminated. That's not the case with first 3 process. If I kill any one of first 3 process, the job stays there until all remaining processes complete their execution. But if I killed the last one, it terminates and kill first three processes as well.

my job_scrpt.sh contains: python process1 & process2 & process3 & process4

how does the job termination(in both successful and failed cases) works in slurm?


Solution

  • With

    process1 &
    process2 &
    process3 &
    process4
    

    the script will terminate as soon as process4 terminates (it is the only one "blocking"); and so will the job.

    You should write

    process1 &
    process2 &
    process3 &
    process4 &
    wait
    

    so as to send all four processes to the background, and the wait command will block the script until all of them terminate, and so will the job.