I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?
Compiler: pgfortran Flags: -acc -fast -ta=tesla:managed -Minfo=accel
Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:
!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
Compromised working code with lesser parllelism:
DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO
Thanks!
The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.
Try something like the following and split the parallelism into two levels:
!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)