I'm trying to run simultaneously using slurm a python script that internally parallelizes a process using multiprocessing. The goal of the python script is to solve a stochastic differential equation several times in order to have enough statistics for the analysis. To speed up the process, I split the repetitions into 4 child processes using multiprocessing. On the other hand, to use all my resources I'm trying to run this script simultaneously several times, with different input parameters (i.e. different conditions for the equation). Right now, I'm using one node with 16 CPUs, so I want to run the code 4 times simultaneously assigning 4 CPUs to each set of conditions such that 4x4=16.
My implementation for slurm is a batch script run.sh
that looks like this:
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 16
#SBATCH --time=10:00:00
module purge > /dev/null 2>&1
eval "$(conda shell.bash hook)"
conda activate my_environment
srun --exclusive -n 1 python script.py 0.95 1 1 0 1 0 > nohup1.out 2>&1 &
srun --exclusive -n 1 python script.py 0.955 1 1 0 1 0 > nohup2.out 2>&1 &
srun --exclusive -n 1 python script.py 0.96 1 1 0 1 0 > nohup3.out 2>&1 &
srun --exclusive -n 1 python script.py 0.965 1 1 0 1 0 > nohup4.out 2>&1 &
srun --exclusive -n 1 python script.py 0.97 1 1 0 1 0 > nohup5.out 2>&1 &
srun --exclusive -n 1 python script.py 0.975 1 1 0 1 0 > nohup6.out 2>&1 &
wait
To run this batch script I'm using sbatch run.sh
. Note that in the example code I have 6 combinations of parameters (I plan to have even more for the real case). I was expecting slurm to run the first 4 based on the available resources and proceed with the next 2 when finishing. However, it sends all the scripts at the same time. I suspect this is because of the flag --exclusive
. However, when I remove it, it just runs the script one by one.
Thanks!
I solved it like this, it could be helpful for other people.
#!/bin/bash
module purge > /dev/null 2>&1
eval "$(conda shell.bash hook)"
conda activate <my_environment>
A=1.0
B=1.0
C=0.0
D=0.0
E=1.0
for F in 0.95 0.955 0.96 0.965 0.97 0.975
do
out_name=A$A-B$B-C$C-D$D-E$E-F$F
JOBNAME=<custom_job_name>
srun -n 1 -c 4 --time=100:00:00 --job-name=$JOBNAME python scripts/msd_fle_two_baths_parallel.py $F $A $B $C $D $E > nohup$out_name.out 2>&1 &
done
Running the script as ./run.sh
slurm allocates the jobs based on the available resources. The flag -n 1
assigns the task only to one job, and -c 4
assigns 4 CPUs to each job, allowing Python's multiproccess.
I parameterized all the inputs from A to E, to generalize it, and put the line to run the script inside a for loop sweeping along F to avoid redundancy.