multithreading fortran mpi openmp hybrid

Threads making MPI calls in a Hybrid MPI/OpenMP

I have found an issue in my hybrid MPI/OpenMP code that is reproduced in the simplest form in the code cited below. I am using 2 threads per MPI rank. These two threads are then used in a OpenMP "Section" to do several computations, one of these consists on making a "mpi_allreduce" call on two different vectors A and B whose results are stored in W and WW. The problem is that every time I run the program I end up with a different output. My mind is that the MPI calls are overlapping and the reduced arrays W and WW are combined even when they have different names but I am not sure. Any comment on how to overcome this issue is welcome.

Details: The MPI thread level is initialized to MPI_THREAD_MULTIPLE in the code but I have tried also serial and funneled (with same issue).

I compile the code mpiifort -openmp allreduce_omp_mpi.f90 and for running I use:

export OMP_NUM_THREADS=2 mpirun -np 3 ./a.out

      PROGRAM HELLO
      use mpi 
      use omp_lib
      IMPLICIT NONE

      INTEGER nthreads, tid 

      Integer Provided,mpi_err,myid,nproc
      CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
      INTEGER :: nhostchars

      integer :: i
      real*8 :: A(1000), B(1000), W(1000),WW(1000)

      provided=0
      !Initialize MPI context
      call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
      CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
      CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
      CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)

      !Initialize arrays
      A=1.0
      B=2.0
      !Check if MPI_THREAD_MULTIPLE is available
      if (provided >= MPI_THREAD_MULTIPLE) then
      write(6,*) ' mpi_thread_multiple provided',myid
      else
      write(6,*) ' not mpi_thread_multiple provided',myid
      endif

!$OMP PARALLEL PRIVATE(nthreads, tid) NUM_THREADS(2)
!$omp sections
!$omp section
       call mpi_allreduce(A,W,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp section
       call mpi_allreduce(B,WW,1000,mpi_double_precision,mpi_sum,mpi_comm_world,mpi_err)
!$omp end sections
!$OMP END PARALLEL

       write(6,*) 'W',(w(i),i=1,10)
       write(6,*) 'WW',(ww(i),i=1,10)

      CALL mpi_finalize(mpi_err)
      END

Solution

The MPI standard forbids concurrent execution of (blocking) collective operations over the same communicator (Section 5.13 "Correctness [of collective communication]"):

...

Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.

The key point here is: same communicator. Nothing prevents you from starting concurrent collective communications over different communicators:

integer, dimension(2) :: comms

call MPI_COMM_DUP(MPI_COMM_WORLD, comms(1), ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, comms(2), ierr)

!$omp parallel sections num_threads(2)
!$omp section
    call MPI_ALLREDUCE(A, W, 1000, MPI_REAL8, MPI_SUM, comms(1), ierr)
!$omp section
    call MPI_ALLREDUCE(B, WW, 1000, MPI_REAL8, MPI_SUM, comms(2), ierr)
!$omp end parallel sections

call MPI_COMM_FREE(comms(1), ierr)
call MPI_COMM_FREE(comms(2), ierr)

This program simply duplicates MPI_COMM_WORLD twice. The first copy is used in the first parallel section, the second copy is used in the second one. Although the two new communicators are copies of MPI_COMM_WORLD, they are separate contexts and thus concurrent operations over them are possible.

MPI_COMM_DUP is an expensive operation, therefore the newly created communicators should be used for as long as possible before being freed.