Search code examples
bashjob-schedulingexit-codeslurm

Start independent job steps and keep track of highest exit code


I want to start many independent tasks (job steps) as part of one job and want to keep track of the highest exit code of all these tasks.

Inspired by this question I am currently doing something like

#SBATCH stuf....

for i in {1..3}; do
    srun -n 1 ./myprog ${i} >& task${i}.log &
done

wait

in my jobs.sh, which I sbatch, to start my tasks.

How can I define a variable exitcode which, after the wait command, contains the highest exit code of all the tasks?

Thanks so much in advance!


Solution

  • You can store jobs' pids in an array and wait for each one, like this

    #SBATCH stuf....
    
    for i in {1..3}; do
        srun -n 1 ./myprog ${i} >& task${i}.log &
        pids+=($!)
    done
    
    for pid in ${pids[@]}; do
        wait $pid
        exitcode=$[$? > exitcode ? $? : exitcode]
    done
    
    echo $exitcode