I have found an issue in my hybrid MPI/OpenMP code that is reproduced in the simplest form in the code cited below. I am using 2 threads per MPI rank. These two threads are then used in a OpenMP "Section" to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results are stored in W and WW. The problem is that every time I run the program I end up with a different output. My mind is that the MPI calls are overlapping and the reduced arrays W and WW are combined even when they have different names but I am not sure. Any comment on how to overcome this issue is welcome.
Details: The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code but I have tried also serial and funneled (with same issue).
I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for running I use:
export OMP_NUM_THREADS=2 mpirun -np 3 ./a.out
PROGRAM HELLO
use mpi
use omp_lib
IMPLICIT NONE
INTEGER nthreads, tid
Integer Provided,mpi_err,myid,nproc
CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
INTEGER :: nhostchars
integer :: i
real*8 :: A(1000), B(1000), W(1000),WW(1000)
provided=0
!Initialize MPI context
call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
!Initialize arrays
A=1.0
B=2.0
!Check if MPI_THREAD_MULTIPLE is available
if (provided >= MPI_THREAD_MULTIPLE) then
write(6,*) ' mpi_thread_multiple provided',myid
else
write(6,*) ' not mpi_thread_multiple provided',myid
endif
!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL
write(6,*) 'W',(w(i),i=1,10)
write(6,*) 'WW',(ww(i),i=1,10)
CALL mpi_finalize(mpi_err)
END
The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):
...
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.
The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:
integer, dimension(2) :: comms
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)
!$omp parallel sections num_threads(2)
!$omp section
call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections
call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)
This program simply duplicates MPI_COMM_WORLD
twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD
, they are separate contexts and thus concurrent operations over them are possible.
MPI_COMM_DUP
is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.