My most typical use case is running a single script over multiple directories (usually R or Matlab). I have access to a high-performance computing environment (SLURM-based). From my research so far, it is unclear to me which of the following approaches would be preferred to make most efficient use of the CPUs/cores available. I also want to make sure I'm not unnecessarily taking up system resources so I'd like to double check which of the following two approaches is most suitable.
Approach 1:
Approach 2:
I'm still new to this so if I've mixed up something here or you need more details to answer the question please let me know.
If you do not explicitly use MPI inside your original R or Matlab script, I suggest you avoid using MPI at all and use job arrays.
Assuming you have a script myscript.R
and a set of subdirectories data01
, data02
, ..., data10
, and the scripts takes the name of the directory as input parameter, you can do the following.
Create a submission script in the directory parent of the data directories:
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=2G
#SBATCH --time 1-0
#SBATCH --array=1-10
DIRS=(data*/) # Create a Bash array with all data directories
module load R
Rscript myscript.R ${DIRS[$SLURM_ARRAY_TASK_ID]} # Feed the script with the data directory
# corresponding to the task ID in the array
This script will create a job array where each job will run the myscript.R
with one of the data directories as argument.
Of course you will need to adapt the values of the memory and time, and investigate whether or not using more than one CPU per job is beneficial in your case. And adapt the --array
parameter to the actual number of directories in your case.