Search code examples
hpcslurm

Preferred approach for running one script over multiple directories in SLURM


My most typical use case is running a single script over multiple directories (usually R or Matlab). I have access to a high-performance computing environment (SLURM-based). From my research so far, it is unclear to me which of the following approaches would be preferred to make most efficient use of the CPUs/cores available. I also want to make sure I'm not unnecessarily taking up system resources so I'd like to double check which of the following two approaches is most suitable.

Approach 1:

  1. Parallelize code within the script (MPI).
  2. Wrap this in a loop that applies the script to all directories.
  3. Submit this as a single MPI job as a SLURM script.

Approach 2:

  1. Parallelize code within the script (MPI).
  2. Create an MPI job array, one job per directory, each running the script on the directory.

I'm still new to this so if I've mixed up something here or you need more details to answer the question please let me know.


Solution

  • If you do not explicitly use MPI inside your original R or Matlab script, I suggest you avoid using MPI at all and use job arrays.

    Assuming you have a script myscript.R and a set of subdirectories data01, data02, ..., data10, and the scripts takes the name of the directory as input parameter, you can do the following.

    Create a submission script in the directory parent of the data directories:

    #!/bin/bash
    #SBATCH --ntasks 1
    #SBATCH --cpus-per-task 1
    #SBATCH --mem-per-cpu=2G
    #SBATCH --time 1-0
    #SBATCH --array=1-10
    
    DIRS=(data*/) # Create a Bash array with all data directories
    
    module load R
    Rscript myscript.R ${DIRS[$SLURM_ARRAY_TASK_ID]} # Feed the script with the data directory
                                                     # corresponding to the task ID in the array
    

    This script will create a job array where each job will run the myscript.R with one of the data directories as argument.

    Of course you will need to adapt the values of the memory and time, and investigate whether or not using more than one CPU per job is beneficial in your case. And adapt the --array parameter to the actual number of directories in your case.