Multi-threaded multi GPU computation using openMP and openACC

I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background. I have less skills in programming.

This is how the basic idea look's like

And this is the code so far developed.

    CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)

id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia) 

    !!$acc kernels
    !error=0.0_rk
    !!$omp do
    !$acc kernels
    !!$omp do
    do j=2,nj-1
        !!$acc kernels  
        do i=2,ni-1
            T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
        enddo
        !!$acc end kernels
    enddo
    !!$omp end do
    !$acc end kernels
    !!$acc update host(T,T_o)

    error=0.0_rk
    do j=2,nj-1
        do i=2,ni-1
            error = max( abs(T(i,j) - T_o(i,j)), error)
            T_o(i,j) = T(i,j)
        enddo
    enddo
    !!$acc end kernels
    !!$acc update host(T,T_o,error)
    
    iteration = iteration+1
    print*,iteration   , error 
    !print*,id
    
enddo
!$omp end parallel

Solution

There's a number of issues here.

First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.

Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.

For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.

In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.