Search code examples
pythonpython-3.xsubprocessopenmpimpi4py

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py


Repeated serial calls to subprocess.Popen() in a script parallelized with mpi4py eventually cause what seems to be data corruption during communication, manifesting as a pickle.unpickling error of various types (I have seen the unpickling errors: EOF, invalid unicode character, invalid load key, unpickling stack underflow). It seems to only happen when the data being communicated is large, the number of serial calls to subprocess is large, or the number of mpi processes is large.

I can reproduce the error with python>=2.7, mpi4py>=3.0.1, and openmpi>=3.0.0. I would ultimately like to communicate python objects so I am using the lowercase mpi4py methods. Here is a minimum code which reproduces the error:

#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess

nr_calcs           = 4
tasks_per_calc     = 44
data_size          = 55000

# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):

    # Init MPI
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    comm_size = comm.Get_size()                                                                                                                             

    # Run Moc Calcs                                                                                                                                                            
    icalc = 0
    while True:
        if icalc > nr_calcs - 1: break
        index = icalc
        icalc += 1

        # Init Moc Tasks
        task_list = []
        moc_task = data_size*"x"
        if rank==0:
            task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
        task_list = comm.bcast(task_list)

        # Moc Run Tasks
        itmp = rank
        while True:
            if itmp > len(task_list)-1: break
            itmp += comm_size
            proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
                    stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
            out,err = proc.communicate()

        print("Rank {:3d} Finished Calc {:3d}".format(rank, index))

# --------------------------------------------------------------------
if __name__ == '__main__':
    run_test(nr_calcs, tasks_per_calc, data_size)

Running this on one 44 core node with 44 mpi processes completes the first 3 "calculations" successfully, but on the final loop some processes raise:

Traceback (most recent call last):
  File "./run_test.py", line 54, in <module>
    run_test(nr_calcs, tasks_per_calc, data_size)
  File "./run_test.py", line 39, in run_test
    task_list = comm.bcast(task_list)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
_pickle.UnpicklingError

Sometimes the UnpicklingError has a descriptor, such as invalid load key "x", or EOF Error, invalid unicode character, or unpickling stack underflow.

Edit: It appears the problem disappears with openmpi<3.0.0, and using mvapich2, but it would still be good to understand what's going on.


Solution

  • I had the same problem. In my case, I got my code to work by installing mpi4py in a Python virtual environment, and by setting mpi4py.rc.recv_mprobe = False as suggested by Intel: https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters

    However, in the end I just switched to using the capital letter methods Recv and Send with NumPy arrays. They work fine with subprocess and they don't need any additional tricks.