Search code examples
parallel-processingslurmhpcopenfoam

Slurm script for parallel execution of independent tasks not working


I am having a problem with the Slurm script as shown below:

#!/bin/bash
#
#SBATCH --job-name=parReconstructPar        # Job name
#SBATCH --output=log.parReconstructPar      # Standard output and error log
#SBATCH --partition=orbit                   # define the partition
#SBATCH -n 32
#

srun --exclusive -n1 reconstructPar -allRegions -time 0.0:0.3 &
srun --exclusive -n1 reconstructPar -allRegions -time 0.35:0.65 &
srun --exclusive -n1 reconstructPar -allRegions -time 0.7:1.0 &
srun --exclusive -n1 reconstructPar -allRegions -time 1.05:1.35 &
srun --exclusive -n1 reconstructPar -allRegions -time 1.4:1.7 &
srun --exclusive -n1 reconstructPar -allRegions -time 1.75:2.05 &
srun --exclusive -n1 reconstructPar -allRegions -time 2.1:2.4 &
srun --exclusive -n1 reconstructPar -allRegions -time 2.45:2.75 &
srun --exclusive -n1 reconstructPar -allRegions -time 2.8:3.1 &
srun --exclusive -n1 reconstructPar -allRegions -time 3.15:3.4 &
srun --exclusive -n1 reconstructPar -allRegions -time 3.45:3.7 &
srun --exclusive -n1 reconstructPar -allRegions -time 3.75:4.0 &
srun --exclusive -n1 reconstructPar -allRegions -time 4.05:4.3 &
srun --exclusive -n1 reconstructPar -allRegions -time 4.35:4.6 &
srun --exclusive -n1 reconstructPar -allRegions -time 4.65:4.9 &
srun --exclusive -n1 reconstructPar -allRegions -time 4.95:5.2 &
srun --exclusive -n1 reconstructPar -allRegions -time 5.25:5.5 &
srun --exclusive -n1 reconstructPar -allRegions -time 5.55:5.8 &
srun --exclusive -n1 reconstructPar -allRegions -time 5.85:6.1 &
srun --exclusive -n1 reconstructPar -allRegions -time 6.15:6.4 &
srun --exclusive -n1 reconstructPar -allRegions -time 6.45:6.7 &
srun --exclusive -n1 reconstructPar -allRegions -time 6.75:7.0 &
srun --exclusive -n1 reconstructPar -allRegions -time 7.05:7.3 &
srun --exclusive -n1 reconstructPar -allRegions -time 7.35:7.6 &
srun --exclusive -n1 reconstructPar -allRegions -time 7.65:7.9 &
srun --exclusive -n1 reconstructPar -allRegions -time 7.95:8.2 &
srun --exclusive -n1 reconstructPar -allRegions -time 8.25:8.5 &
srun --exclusive -n1 reconstructPar -allRegions -time 8.55:8.8 &
srun --exclusive -n1 reconstructPar -allRegions -time 8.85:9.1 &
srun --exclusive -n1 reconstructPar -allRegions -time 9.15:9.4 &
srun --exclusive -n1 reconstructPar -allRegions -time 9.45:9.7 &
srun --exclusive -n1 reconstructPar -allRegions -time 9.75:10.0 &

The script is supposed to submit several tasks that are independent from each other and should run in parallel. However, when submitting the job to the scheduler, the tasks aren't launched and the job is removed immediately. The log file does not show a single entry.

If someone could tell me, what is wrong with this, that would be very appreciated.

Best regards

I tried running the script without --exclusive and also with explicit memory allocation.


Solution

  • You are missing the command wait at the end of the submission script. Without wait to wait for all the backgrounded processes to complete, the script will exit straight away as you have seen.

    i.e. Your script should be:

    #!/bin/bash
    #
    #SBATCH --job-name=parReconstructPar        # Job name
    #SBATCH --output=log.parReconstructPar      # Standard output and error log
    #SBATCH --partition=orbit                   # define the partition
    #SBATCH -n 32
    #
    
    srun --exclusive -n1 reconstructPar -allRegions -time 0.0:0.3 &
    srun --exclusive -n1 reconstructPar -allRegions -time 0.35:0.65 &
    srun --exclusive -n1 reconstructPar -allRegions -time 0.7:1.0 &
    srun --exclusive -n1 reconstructPar -allRegions -time 1.05:1.35 &
    srun --exclusive -n1 reconstructPar -allRegions -time 1.4:1.7 &
    srun --exclusive -n1 reconstructPar -allRegions -time 1.75:2.05 &
    srun --exclusive -n1 reconstructPar -allRegions -time 2.1:2.4 &
    srun --exclusive -n1 reconstructPar -allRegions -time 2.45:2.75 &
    srun --exclusive -n1 reconstructPar -allRegions -time 2.8:3.1 &
    srun --exclusive -n1 reconstructPar -allRegions -time 3.15:3.4 &
    srun --exclusive -n1 reconstructPar -allRegions -time 3.45:3.7 &
    srun --exclusive -n1 reconstructPar -allRegions -time 3.75:4.0 &
    srun --exclusive -n1 reconstructPar -allRegions -time 4.05:4.3 &
    srun --exclusive -n1 reconstructPar -allRegions -time 4.35:4.6 &
    srun --exclusive -n1 reconstructPar -allRegions -time 4.65:4.9 &
    srun --exclusive -n1 reconstructPar -allRegions -time 4.95:5.2 &
    srun --exclusive -n1 reconstructPar -allRegions -time 5.25:5.5 &
    srun --exclusive -n1 reconstructPar -allRegions -time 5.55:5.8 &
    srun --exclusive -n1 reconstructPar -allRegions -time 5.85:6.1 &
    srun --exclusive -n1 reconstructPar -allRegions -time 6.15:6.4 &
    srun --exclusive -n1 reconstructPar -allRegions -time 6.45:6.7 &
    srun --exclusive -n1 reconstructPar -allRegions -time 6.75:7.0 &
    srun --exclusive -n1 reconstructPar -allRegions -time 7.05:7.3 &
    srun --exclusive -n1 reconstructPar -allRegions -time 7.35:7.6 &
    srun --exclusive -n1 reconstructPar -allRegions -time 7.65:7.9 &
    srun --exclusive -n1 reconstructPar -allRegions -time 7.95:8.2 &
    srun --exclusive -n1 reconstructPar -allRegions -time 8.25:8.5 &
    srun --exclusive -n1 reconstructPar -allRegions -time 8.55:8.8 &
    srun --exclusive -n1 reconstructPar -allRegions -time 8.85:9.1 &
    srun --exclusive -n1 reconstructPar -allRegions -time 9.15:9.4 &
    srun --exclusive -n1 reconstructPar -allRegions -time 9.45:9.7 &
    srun --exclusive -n1 reconstructPar -allRegions -time 9.75:10.0 &
    
    
    wait