Search code examples
bashslurmsbatch

Do I need a single bash file for each task in SLURM?


I am trying to launch several task in a SLURM-managed cluster, and would like to avoid dealing with dozens of files. Right now, I have 50 tasks (subscripted i, and for simplicity, i is also the input parameter of my program), and for each one a single bash file slurm_run_i.sh which indicates the computations configuration, and the srun command:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1 
#SBATCH -J pltCV
#SBATCH --mem=30G

srun python plotConvergence.py i

I am then using another bash file to submit all these tasks, slurm_run_all.sh

#!/bin/bash
for i in {1..50}:
  sbatch slurm_run_$i.sh 
done

This works (50 jobs are running on the cluster), but I find it troublesome to have more than 50 input files. Searching a solution, I came up with the & command, obtaining something as:

#!/bin/bash

#SBATCH --ntasks=50
#SBATCH --cpus-per-task=1 
#SBATCH -J pltall
#SBATCH --mem=30G

# Running jobs 
srun python plotConvergence.py 1   &
srun python plotConvergence.py 2   & 
...
srun python plotConvergence.py 49  & 
srun python plotConvergence.py 50  & 
wait
echo "All done"

Which seems to run as well. However, I cannot manage each of these jobs independently: the output of squeue shows I have a single job (pltall) running on a single node. As there are only 12 cores on each node in the partition I am working in, I am assuming most of my jobs are waiting on the single node I've been allocated to. Setting the -N option doesn't change anything too.. Moreover, I cannot cancel some jobs individually anymore if I realize there's a mistake or something, which sounds problematic to me.

Is my interpretation right, and is there a better way (I guess) than my attempt to process several jobs in slurm without being lost among many files ?


Solution

  • What you are looking for is the jobs array feature of Slurm.

    In your case, you would have a single submission file (slurm_run.sh) like this:

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1 
    #SBATCH -J pltCV
    #SBATCH --mem=30G
    #SBATCH --array=1-50
    
    srun python plotConvergence.py ${SLURM_ARRAY_TASK_ID}
    

    and then submit the array of jobs with

    sbatch slurm_run.sh
    

    You will see that you will have 50 jobs submitted. You can cancel all of them at once or one by one. See the man page of sbatch for details.