Search code examples
fortranmpiopenaccnvidia-hpc-compilers

Inter GPU communication in MPI+OpenACC programming


I am trying to learn how to perform inter-gpu data communication using the following toy code. The task of the program is to send array 'a' data in gpu-0 in to gpu-1's memory. I took the following root to do so, which involved four steps:

After initializing array 'a' on gpu0,

  • step1: send data from gpu0 to cpu0 (using !acc update self() clause)
  • step2: send data from cpu0 to cpu1 (using MPI_SEND())
  • step3: receive data into cpu1 from cpu0 (using MPI_RECV())
  • step4: update gpu1 device memory (using !$acc update device() clause)

This works perfectly fine, but this looks like a very long route and I think there is a better way of doing this. I tried to read up on !$acc host_data use_device clause suggested in the following post, but not able to implement it:

Getting started with OpenACC + MPI Fortran program

I would like to know how !$acc host_data use_device can be used, to perform the task shown below in an efficient manner.

PROGRAM TOY_MPI_OpenACC
    
    implicit none
    
    include 'mpif.h'
    
    integer :: rank, nprocs, ierr, i, dest_rank, tag, from
    integer :: status(MPI_STATUS_SIZE)
    integer, parameter :: N = 10000
    double precision, dimension(N) :: a
    
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
    
    print*, 'Process ', rank, ' of', nprocs, ' is alive'
    
    !$acc data create(a)
    
        ! initialize 'a' on gpu0 (not cpu0)
        IF (rank == 0) THEN
            !$acc parallel loop default(present)
            DO i = 1,N
                a(i) = 1
            ENDDO
        ENDIF
        
        ! step1: send data from gpu0 to cpu0
        !$acc update self(a)
        
        print*, 'a in rank', rank, ' before communication is ', a(N/2)
        
        
        IF (rank == 0) THEN
            
            ! step2: send from cpu0
            dest_rank = 1;      tag = 1999
            call MPI_SEND(a, N, MPI_DOUBLE_PRECISION, dest_rank, tag, MPI_COMM_WORLD, ierr)
            
        ELSEIF (rank == 1) THEN
            
            ! step3: recieve into cpu1
            from = MPI_ANY_SOURCE;      tag = MPI_ANY_TAG;  
            call MPI_RECV(a, N, MPI_DOUBLE_PRECISION, from, tag, MPI_COMM_WORLD, status, ierr)
            
            ! step4: send data in to gpu1 from cpu1
            !$acc update device(a)
        ENDIF
        
        call MPI_BARRIER(MPI_COMM_WORLD, ierr)
        
        
        print*, 'a in rank', rank, ' after communication is ', a(N/2)
    
    !$acc end data
    call MPI_BARRIER(MPI_COMM_WORLD, ierr)
END

compilation: mpif90 -acc -ta=tesla toycode.f90 (mpif90 from nvidia hpc-sdk 21.9)

execution : mpirun -np 2 ./a.out


Solution

  • Here's an example. Note that I also added some boiler-plate code to do the local node rank to device assignment. I also prefer to use unstructured data regions since they're better for more complex codes, but here they would be semantically equivalent to the structured data region that you used above. I have guarded the host_data constructs under a CUDA_AWARE_MPI macro since not all MPI have CUDA Aware support enabled. For these, you'd need to revert back to copying the data between the host and device before/after the MPI calls.

    % cat mpi_acc.F90
    PROGRAM TOY_MPI_OpenACC
        use mpi
    #ifdef _OPENACC
        use openacc
    #endif
        implicit none
    
        integer :: rank, nprocs, ierr, i, dest_rank, tag, from
        integer :: status(MPI_STATUS_SIZE)
        integer, parameter :: N = 10000
        double precision, dimension(N) :: a
    #ifdef _OPENACC
        integer :: dev, devNum, local_rank, local_comm
        integer(acc_device_kind) :: devtype
    #endif
    
        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
        print*, 'Process ', rank, ' of', nprocs, ' is alive'
    
    #ifdef _OPENACC
    ! set the MPI rank to device mapping
    ! 1) Get the local node's rank number
    ! 2) Get the number of devices on the node
    ! 3) Round-Robin assignment of rank to device
         call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
              MPI_INFO_NULL, local_comm,ierr)
         call MPI_Comm_rank(local_comm, local_rank,ierr)
         devtype = acc_get_device_type()
         devNum = acc_get_num_devices(devtype)
         dev = mod(local_rank,devNum)
         call acc_set_device_num(dev, devtype)
         print*, "Process ",rank," Using device ",dev
    #endif
    
        a = 0
        !$acc enter data copyin(a)
    
            ! initialize 'a' on gpu0 (not cpu0)
            IF (rank == 0) THEN
                !$acc parallel loop default(present)
                DO i = 1,N
                    a(i) = 1
                ENDDO
                !$acc update self(a)
            ENDIF
    
            ! step1: send data from gpu0 to cpu0
            print*, 'a in rank', rank, ' before communication is ', a(N/2)
    
            IF (rank == 0) THEN
    
                ! step2: send from cpu0
                dest_rank = 1;      tag = 1999
    #ifdef CUDA_AWARE_MPI
                !$acc host_data use_device(a)
    #endif
                call MPI_SEND(a, N, MPI_DOUBLE_PRECISION, dest_rank, tag, MPI_COMM_WORLD, ierr)
    #ifdef CUDA_AWARE_MPI
                !$acc end host_data
    #endif
    
            ELSEIF (rank == 1) THEN
    
                ! step3: recieve into cpu1
                from = MPI_ANY_SOURCE;      tag = MPI_ANY_TAG;
    #ifdef CUDA_AWARE_MPI
                !$acc host_data use_device(a)
    #endif
                call MPI_RECV(a, N, MPI_DOUBLE_PRECISION, from, tag, MPI_COMM_WORLD, status, ierr)
    #ifdef CUDA_AWARE_MPI
                !$acc end host_data
    #else
                ! step4: send data in to gpu1 from cpu1
                !$acc update device(a)
    #endif
            ENDIF
    
            call MPI_BARRIER(MPI_COMM_WORLD, ierr)
    
           !$acc update self(a)
            print*, 'a in rank', rank, ' after communication is ', a(N/2)
    
        !$acc exit data delete(a)
        call MPI_BARRIER(MPI_COMM_WORLD, ierr)
    END
    
    % which mpif90
    /proj/nv/Linux_x86_64/21.9/comm_libs/mpi/bin//mpif90
    % mpif90 -V
    
    nvfortran 21.9-0 64-bit target on x86-64 Linux -tp skylake
    NVIDIA Compilers and Tools
    Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
    % mpif90 -acc -Minfo=accel mpi_acc.F90
    toy_mpi_openacc:
         38, Generating enter data copyin(a(:))
         42, Generating Tesla code
             43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         42, Generating default present(a(:))
         46, Generating update self(a(:))
         76, Generating update device(a(:))
         82, Generating update self(a(:))
         85, Generating exit data delete(a(:))
    % mpirun -np 2 ./a.out
     Process             1  of            2  is alive
     Process             0  of            2  is alive
     Process             0  Using device             0
     Process             1  Using device             1
     a in rank            1  before communication is     0.000000000000000
     a in rank            0  before communication is     1.000000000000000
     a in rank            0  after communication is     1.000000000000000
     a in rank            1  after communication is     1.000000000000000
    % mpif90 -acc -Minfo=accel mpi_acc.F90 -DCUDA_AWARE_MPI=1
    toy_mpi_openacc:
         38, Generating enter data copyin(a(:))
         42, Generating Tesla code
             43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         42, Generating default present(a(:))
         46, Generating update self(a(:))
         82, Generating update self(a(:))
         85, Generating exit data delete(a(:))
    % mpirun -np 2 ./a.out
     Process             0  of            2  is alive
     Process             1  of            2  is alive
     Process             1  Using device             1
     Process             0  Using device             0
     a in rank            1  before communication is     0.000000000000000
     a in rank            0  before communication is     1.000000000000000
     a in rank            1  after communication is     1.000000000000000
     a in rank            0  after communication is     1.000000000000000