MPI4Py comm.Barrier() not blocking on MSMPI?

While implementing a parallel algorithm in Python 3.7.0 using MPI4PY 3.0.0 on MSMPI on Windows 10, I was having problems with the Gatherv not gathering everything... When cheking printing of various bits it seemed to be executing things in the wrong order.

I wrote a bit of code that duplicates the problem :

from mpi4py import MPI
from time import sleep
import random

comm = MPI.COMM_WORLD
rank = comm.Get_rank()


if rank == 0:
    sleep(2)
    print("head finished sleeping")

comm.Barrier()

sleep(random.uniform(0, 2))
print(rank, 'finished sleeping ')

comm.Barrier()

if rank == 0:
    print("All done!")

If I understand comm.Barrier() correctly this should produce

head finished sleeping
2 finished sleeping
0 finished sleeping
3 finished sleeping
1 finished sleeping
4 finished sleeping
All done!

with the middle bits in some order, right? But when I actually run mpiexec -n 5 python .\blocking_test.py I get as follows:

2 finished sleeping
1 finished sleeping
3 finished sleeping
head finished sleeping
0 finished sleeping
All done!
4 finished sleeping

Do I misunderstand the usage of comm.Barrier(), or is there something wrong with my environment?

Solution

The reason they appear to be printed in the wrong order is because of the MPI back-end that collects messages. The standard output stream of all the child processes are not connected directly to the terminal window because that is impossible across multiple computers.

Instead the MPI back-end is collecting all the messages from each process. Then, it is using the standard MPI calls to collect these messages in the back-end of rank 0. It is in that communication where the order of the messages is getting mixed up.

Generally, standard output is not treated with priority in an MPI process, so it makes little effort to print output in the correct order. Typically, the output is kept in the output buffer of the running process. The output is only printed when the following events occur (maybe more though):

  1) The end of the process
  2) When there is a buffer over-flow (i.e. large amount of data is printed to the output)
  3) flush is called on the output buffer (i.e. 'sys.stdout.flush()')

So you can help yourself by flushing your stdout as you print:

  1) print('my message'); sys.stdout.flush()
  2) print('my message on newer version of python', flush=True)

However, in practice it is difficult to get it working properly. If flush events occur for multiple MPI processes at the same time. Then multiple processes will be sending messages to rank 0. Thus, there is a race condition that essentially dictates the order in which things are printed. So, to get things in the correct order, you need to apply a mix of synchronization and sleep calls, so the flush events are called infrequently enough that race conditions are avoided.

I suspect what is happening to you is that the output is only being flushed at the end of the process. Since it happens to all processes at the same time, then what you are seeing is the results of this communication race.

I hope that helps.