Search code examples
pythonopenmpimpi4py

MPI.Gather call hangs for large-ish arrays


I use mpi4py to parallelize my Python application. I noticed that I run into deadlocks during MPI.Gather whenever I increase the number of processes or the involved array sizes too much.

Example:

from mpi4py import MPI

import numpy as np

COMM = MPI.COMM_WORLD
RANK = COMM.Get_rank()
SIZE = COMM.Get_size()


def test():
    arr = RANK * np.ones((100, 400, 15), dtype='int64')

    recvbuf = None
    if RANK == 0:
        recvbuf = np.empty((SIZE,) + arr.shape, dtype=arr.dtype)

    print("%s gathering" % RANK)
    COMM.Gather([arr, arr.size, MPI.LONG], recvbuf, root=0)
    print("%s done" % RANK)

    if RANK == 0:
        for i in range(SIZE):
            assert np.all(recvbuf[i] == i)


if __name__ == '__main__':
    test()

Executing this gives:

$ mpirun -n 4 python bug.py 
1 gathering
2 gathering
3 gathering
0 gathering
1 done
2 done

while processes 0 and 3 hang indefinitely. However, if I change the array dimensions to (10, 400, 15), or run the script with -n 2, everything works as expected.

Am I missing something? Is this a bug in OpenMPI or mpi4py?

Platform:

  • OSX Mojave
  • OpenMPI 4.0.0 (via Homebrew)
  • mpi4py 3.0.1
  • Python 3.7

Solution

  • I just noticed that everything works fine with MPICH via Homebrew. So, in case anyone runs into a similar situation on OSX, a workaround is

    $ brew unlink open-mpi
    $ brew install mpich
    $ pip uninstall mpi4py
    $ pip install mpi4py --no-cache-dir
    

    Then, I had to edit /etc/hosts and add the line

    127.0.0.1     <mycomputername>
    

    in order for MPICH to work correctly.

    Update:

    By now, this issue should be fixed. The bug was reported and updating OpenMPI to 4.0.1 fixed it for me.