Search code examples
fortranopenacc

OpenACC | Fortran 90: What is the best way to parallelize nested DO loop?


I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?

Compiler: pgfortran Flags: -acc -fast -ta=tesla:managed -Minfo=accel

Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:

!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop

Compromised working code with lesser parllelism:

DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

    Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
    
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO

Thanks!


Solution

  • The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.

    Try something like the following and split the parallelism into two levels:

    !$acc parallel loop collapse(2)
    DO nbl = 1,nblocks
    DO n_prim = 1,nprims
    !$acc loop collapse(3)
    DO k = 1, NK(nbl)
    DO j = 1, NJ(nbl)
    DO i = 1, NI(nbl)