I have code currently using multiprocessing with joblib
to distribute computations from a loop over multiple cores within one cluster node, like so:
from joblib import Parallel, delayed
def function():
return Parallel(n_jobs=28)(delayed(computation)() for _ in range(n_iter))
save(function()) # save results to some file
...
I am limited to 28 cores on a cluster/node but I'd like to run the above task 4 times. This might strain the multiprocessing load and hinder performance. I've looked into mpi4py
which uses the MPI standard to concurrently run code on multiple nodes.
From what I've learned, I want my code, script.py
to follow this pattern:
from mpi4py import MPI
comm = MPI.COMM_WORLD
my_rank = comm.Get_rank() # number assigned to a node
# below, num_proc would be 5: node 0 receives data,
# and nodes 1-4 doing the work
num_proc = comm.Get_size()
if my_rank != 0:
comm.send(function(), dest=0)
else:
for proc_id in range(1, num_proc):
result = comm.recv(source=proc_id)
save(result)
The script to submit a job with MPI using TORQUE has the general pattern:
#PBS -l nodes=5:ppn=28 # ppn is parts per node, or number of cores
...
mpirun -np $np python script.py
But from what I've seen1, $np
here should be nodes
* ppn
. However, in the loop in script.py
above, I want num_proc to correspond to the number of nodes, and n_jobs
in function
's Parallel
call to correspond to ppn
.
How do I tell MPI to 'decouple' nodes and cores? If I submit mpirun -np 5 python script.py
, is joblib
distributing my computation across 28 cores per node, and mpi4py
distributing this function to 4 nodes?
I know there are various binding options with mpirun
but based on the docs am not sure which one is appropriate.
References
1 see 'submitting a job' in this cluster user guide
If what I understood is correct, you need to launch one mpi process per node in 5 nodes and each process should be able to use all cores (28) with python multiprocessing.
Using this mapping should be sufficient:
mpirun -np $np --map-by ppr:1:node #np is 5
MPI will launch a process per node. Inside the node, python can use all the available cores.