Slow down when subprocesses launched in parallel when using mpi4py

Using mpi4py, I'm running a python program which launches multiple fortran processes in parallel, starting from a SLURM script using (for example):

mpirun -n 4 python myprog.py

but have noticed that myprog.py takes longer to run the higher the number of tasks requested eg. running myprog.py (following code shows only the mpi part of program):

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = None

if rank == 0:
    data = params

recvbuf = np.empty(4, dtype=np.float64) 
comm.Scatter(data, recvbuf, root=0)

py_task(int(recvbuf[0]), recvbuf[1], recvbuf[2], int(recvbuf[3]))

with mpirun -n 1 ... on a single recvbuf array takes 3min- whilst running on four recvbuf arrays (expectedly in parallel), on four processors using mpirun -n 4 ... takes about 5 min. However, I would expect run times to be approximately equal for both the single and four processor cases.

py_task is effectively a python wrapper to launch a fortran program using:

subprocess.check_call(cmd)

There seems to be some interaction between subprocess.check_call(cmd) and the mpi4py package that is stopping the code from properly operating in parallel.

I've looked up this issue but can't seem to find anything that's helped it. Are there any fixes to this issue/ detailed descriptions explaining what's going on here/ recommendations on how to isolate the cause of the bottleneck in this code?

Additional note:

This pipeline has been adapted to mpi4py from "joblib import Parallel", where there was no previous issues with the subprocess.check_call() running in parallel, and is why I suspect this issue is linked with the interaction between subprocess and mpi4py.

Solution

The slowdown was initially fixed by adding in:

export SLURM_CPU_BIND=none

to the slurm script that was launching the jobs.

Whilst the above did provide a temporary fix, the issue was actually much deeper and I will provide a very basic description of it here.

1) I uninstalled the mpi4py I had installed with conda, then reinstalled it with Intel MPI loaded (the recommended MPI version for our computing cluster). In the SLURM script, I then changed the launching of the python program to:

srun python my_prog.py .

and removed the export... line above, and the slowdown was removed.

2) Another slowdown was found for launching > 40 tasks at once. This was due to:

Each time the fortran-based subprocess gets launched, there's a cost to the filesystem requesting the initial resources (eg. supplying file as an argument to the program). In my case there were a large number of tasks being launched simultaneously and each file could be ~500mb, which probably exceeded the IO capabilities of the cluster filesystem. This led to a large overhead introduced to the program from the slowdown of launching each subprocess.

The previous joblib implementation of the parallelisation was only using a maximum of 24 cores at a time, and then there was no significant bottleneck in the requests to the filesystem- hence why no performance issue was previously found.

For 2), I found the best solution was to significantly refactor my code to minimise the number of subprocesses launched. A very simple fix, but one I hadn't been aware of before finding out about the bottlenecks in resource requests on filesystems.

(Finally, I'll also add that using the subprocess module within mpi4py is generally not recommended online, with the multiprocessing module preferred for single node usage. )