I'm currently attempting to accelerate a spectral element fluids solver by porting most of the routines to a GPGPU using OpenACC with the PGI (15.10) compiler. The source code is written in OO-Fortran. This software has "layers" of subroutines that call other functions and subroutines. To bring the code over to a GPU using openacc, I've been first attempting to place "$acc routine" directives in each routine that needs to be ported. During compilation, using "pgf90 -acc -Minfo=accel", I receive the following error :
nvvmCompileProgram error: 9.
Error: /tmp/pgacc2lMnIf9lMqx8.gpu (146, 24): parse invalid forward reference to function 'innerroutine_' with wrong type!
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (Test.f90: 1)
This same problem can be reproduced with the following simple fortran program :
PROGRAM Test
IMPLICIT NONE
CONTAINS
SUBROUTINE OuterRoutine( sol, xF, N )
!$acc routine
IMPLICIT NONE
INTEGER :: N
REAL(KIND=8) :: sol(0:N,1:3)
REAL(KIND=8) :: xF(0:N,1:3)
! LOCAL
INTEGER :: i
DO i = 0, N
xF(i,1:3) = InnerRoutine( sol(i,1:3) )
ENDDO
END SUBROUTINE OuterRoutine
FUNCTION InnerRoutine( sol ) RESULT( xF )
!$acc routine
IMPLICIT NONE
REAL(KIND=8) :: sol(1:3)
REAL(KIND=8) :: xF(1:3)
xF(1) = sol(1)*sol(2)
xF(2) = sol(1)*sol(3)
xF(3) = sol(1)*sol(1)
END FUNCTION InnerRoutine
END PROGRAM Test
Again, compiling the above program with "pgf90 -acc -Minfo=accel" yields the problem.
Does openacc support acc-enabled routines calling other acc-enabled routines ?
If so, what am I doing wrong ?
You're using the OpenACC "routine" directive correctly. The problem here is that we (PGI) don't yet support using "routine" with array-valued functions. The problem being that this support requires the compiler to create a temp array to hold the return value. Meaning that every thread would need to allocate this temp array causing a severe performance penalty. Worse is how to handle sharing the temp array if is a gang or worker level routine.
We do have open requests for this feature, but it may be awhile before we can address it. In the meantime, can you try inlining the routine? i.e. compile with "-Minline".