Search code examples
pythonparallel-processingmultiprocessingmpicluster-computing

How to decouple nodes and cores with MPI?


I have code currently using multiprocessing with joblib to distribute computations from a loop over multiple cores within one cluster node, like so:

from joblib import Parallel, delayed
def function():
    return Parallel(n_jobs=28)(delayed(computation)() for _ in range(n_iter))
save(function()) # save results to some file
...

I am limited to 28 cores on a cluster/node but I'd like to run the above task 4 times. This might strain the multiprocessing load and hinder performance. I've looked into mpi4py which uses the MPI standard to concurrently run code on multiple nodes.

From what I've learned, I want my code, script.py to follow this pattern:

from mpi4py import MPI
comm = MPI.COMM_WORLD
my_rank = comm.Get_rank() # number assigned to a node
# below, num_proc would be 5: node 0 receives data,
# and nodes 1-4 doing the work
num_proc = comm.Get_size() 

if my_rank != 0:
    comm.send(function(), dest=0)
else:
    for proc_id in range(1, num_proc):
        result = comm.recv(source=proc_id)
        save(result)

The script to submit a job with MPI using TORQUE has the general pattern:

#PBS -l nodes=5:ppn=28 # ppn is parts per node, or number of cores
...
mpirun -np $np python script.py

But from what I've seen1, $np here should be nodes * ppn. However, in the loop in script.py above, I want num_proc to correspond to the number of nodes, and n_jobs in function's Parallel call to correspond to ppn.

How do I tell MPI to 'decouple' nodes and cores? If I submit mpirun -np 5 python script.py, is joblib distributing my computation across 28 cores per node, and mpi4py distributing this function to 4 nodes?

I know there are various binding options with mpirun but based on the docs am not sure which one is appropriate.

References
1 see 'submitting a job' in this cluster user guide


Solution

  • If what I understood is correct, you need to launch one mpi process per node in 5 nodes and each process should be able to use all cores (28) with python multiprocessing.

    Using this mapping should be sufficient:

    mpirun -np $np --map-by ppr:1:node #np is 5
    

    MPI will launch a process per node. Inside the node, python can use all the available cores.