Advice on porting nested routines to OpenACC

I am quite new to OpenACC and I'd like to port a quite big Fortran code to GPUs. At present, the code is hybrid MPI and OpenMP, but we'd like to take advantage of GPUs on some big machines. The compiler I'll use for this is nvfortran. The code evolves some 3D quantities, given by 3D arrays with indexes (i1,i2,i3), in time. For this, we perform a lot of different operations on the arrays. An important point is that we always perform operations on only one coordinate at a time, we do not mix coordinates, i.e., we can have

A(i1,i2,i3) = B(i1+1,i2,i3) + C(i1-1,i2,i3)
!OR
A(i1,i2,i3) = B(i1,i2+1,i3) + C(i1,i2-1,i3)

but we don't have A(i1,i2,i3) = B(i1+1,i2,i3) + C(i1,i2-1,i3). This is important because, to take advantage of some optimizations, we: first performs all operations along the first coordinate (i1); second we transpose everything such that we have the variables in the order (i2,i1,i3); third we perform all operations along i2; fourth we transpose the variables to have the order (i3,i1,i2); fifth we perform operations along i3; and finally we transpose everything back to (i1,i2,i3).

The logic of the code is the following:

integer :: m1,n1,m2,n2,m3,n3
integer :: m1c,n1c,m2c,n2c,m3c,n3c
integer :: Nmax, nchunks, ichunk
double precision, allocatable(:,:,:) :: A,B,C
double precision, allocatable(:,:,:) :: sliceA, sliceB, sliceC
!
allocate(A(m1:n1,m2:n2,m3:n3),B(m1:n1,m2:n2,m3:n3),C(m1:n1,m2:n2,m3:n3))
... ! A and B are initialized
! Get the number of chunks
call getChunkSize(Nmax,m1,n1,m2,n2,m3,n3,1,m2c,n2c,m3c,n3c,nchunks)
! Start parallel OMP region
!$OMP PARALLEL DEFAULT(SHARED), &
!$OMP          PRIVATE(sliceA,sliceB,sliceC, &
!$OMP                  m2c, n2c, m3c, n3c, &
!$OMP                  ichunk)
... ! Some stuff
!$OMP DO
do ichunk=1,nchunks
  ! Get indexes m2c,n2c,m3c,n3c
  call getChunkSize(Nmax,m1,n1,m2,n2,m3,n3,ichunk,m2c,n2c,m3c,n3c)
  !
  allocate(sliceA(m1:n1,m2c:n2c,m3c:n3c),sliceB(m1:n1,m2c:n2c,m3c:n3c),sliceC(m1:n1,m2c:n2c,m3c:n3c))
  sliceA(:,:,:) = A(m1:n1,m2c:n2c,m3c:n3c)
  sliceB(:,:,:) = B(m1:n1,m2c:n2c,m3c:n3c)
  !
  call operations(sliceA,sliceB,sliceC)
  !
  C(m1:n1,m2c:n2c,m3c:n3c) = sliceC(:,:,:)
  deallocate(sliceA,sliceB,sliceC)
end do
!$OMP END DO
... ! Some other stuff
!$OMP END PARALLEL
!
! REPEAT the same logic, but for slice(m2:n2,m1c:n1c,m3c:n3c) and for slice(m3:n3,m1c:n1c,m2c:n2c)

Here, Nmax is an integer variable, input of the code, that states the maximum number of elements of the arrays sliceA, sliceB, and sliceC, used to improve cache performances of the code. What we want is to split the big arrays A, B, and C in nchunks pieces (chunks), each piece (chunk) with a number of elements smaller or equal to Nmax. For this, we first use the routine getChunkSize to get nchunks, such that nchunks*(n1-m1+1)*(n2c-m2c+1)*(n3c-m3c+1) = (n1-m1+1)*(n2-m2+1)*(n3-m3+1) and (n1-m1+1)*(n2c-m2c+1)*(n3c-m3c+1)<Nmax. We then have a loop over nchunks, parallelized with OMP, such that each thread takes a slice of the arrays A, B, and C, and performs some operations with the routine operations. The results, that the routine operations returns in sliceC are then copied to the big array C. A similar task is then performed along the two other coordinates, after transposing the arrays such that the first index is always the one along which we operate.

Two notable examples are with Nmax=(n1-m1+1)*(n2-m2+1)*(n3-m3+1), for which we obviously have nchunks=1 (one single chunk, containing the full arrays), and with Nmax=n1-m1+1, for which we have nchunks=(n2-m2+1)*(n3-m3+1) (one single "line" m1:n1 for each slice).

The routine operations calls some other routines to perform many operations on the arrays sliceA and sliceB to produce sliceC. These other routines contains, among other things, loops in the form

do i3=m3c,n3c
  do i2=m2c,n2c
    do i1=m1,n1
      sliceC(i1,i2,i3) = ...
    end do
  end do
end do

where the operations, as stated above, always have the same indexes i2 and i3 on the left and on the right of the statement, only i1 can differ (generally we can use i1-2, i1-1, i1, i1+1, and i1+2).

Here is my question: what would be the best approach to parallelize with OpenACC directives this code? I see two possibilities:

I keep the same logic, and parallelize the external loop (the one over chunks) over the different gangs, while I parallelize the inner loops in the subroutines (the triple loops over i3,i2,i1) over workers and vectors. In this case, how should I define the operations routine with acc? Moreover, what would be the ideal value for Nmax (e.g., should I try to have the resulting nchunks a multiple of the number of gangs available on the GPU)?
I set Nmax=(n1-m1+1)*(n2-m2+1)*(n3-m3+1) in the input file (the OpenMP version should remain the same, the code should work both on CPUs and on GPUs, I cannot break the OpenMP functionality, although I can introduce some preprocessing flag to avoid copying data from A to sliceA - and similarly for the other variables - when OpenACC is active), such that I do not have an external loop (nchunks=1), and I apply all the GPU parallelization on the inner loops only.

What would be the benefits/disadvantages of the two strategies? Is one definitely better than the other? Are there other possibilities? One note: in the routine operations I have to allocate/deallocate several other arrays of size (m1:n1,m2c:n2c,m3c:n3c).

Any advice on this subject would be highly appreciated. If you can provide real examples and/or the syntax for the outer loop (in particular what acc directive I should use for defining the operations routine), even better (I didn't find a good tutorial/book explaining in detail how the !$acc routine works)!

Solution

Ideally you'd get rid of the chunking and operate directly on the full A, B, and C arrays. You could set the number of chunks to 1, but then you'll suffer the overhead of having to copy the slices to the device each time the routine is called.

I generally advise the strategy of a top-down approach for data and bottom-up for compute. Meaning move your data regions early in the program, preferably using a "enter data create" just after the array allocation, so the lifetime and scope of the arrays are the same on the host and device. Then use the "update" directive to synchronize the host and device copies of the array when necessary. Data movement is often one of the biggest performance bottle-necks so minimizing this critical.

While you don't show the code for "operations", you'll likely want to put the OpenACC parallel compute regions around the loops in this routine. If there are multiple loops, then you'll use multiple compute regions.

"acc routine" tells the compiler to create a device callable version of the routine. It needs to be visible to both the caller and the callee so usually placed in the header section of the routine's definition in a module or interface. If there isn't an interface, i.e. F77-style, then you need it twice. Once in the routine's definition so a device version is compiled, and once in the caller's definition so it's declared and the compiler knows that a device version is available. For the declaration, the form "acc routine(routine_name)" would be used.

If you have parallelizable loops inside the routines themselves, then you may opt to use "routine vector", meaning that routine will have a vector level parallel loop inside the routine. The caller would then have a "gang" parallel loop.

For OpenACC vs OpenMP, it's largely user preference at this point and my advice above applies both and GPU programming in general.

OpenMP's data management directives, i.e. "target data", are basically the same as OpenACC, just different syntax.

For compute, OpenMP has two methods, "target teams distribute parallel do" and "target teams loop". "distribute" has more flexibility on the types constructs that can be used on the device, but incurs more overhead. "loop" is more restrictive but often more performant (at least with nvfortran). "loop" is basically the same as OpenACC's "parallel" construct.

There is one key difference in that OpenMP doesn't have a way to expose parallelism inside of device routines. So if you do want to use "routine vector", then you'll want to stick with OpenACC.