Search code examples
fortrangpgpuopenaccpgipgi-accelerator

How can a Fortran-OpenACC routine call another Fortran-OpenACC routine?


I'm currently attempting to accelerate a spectral element fluids solver by porting most of the routines to a GPGPU using OpenACC with the PGI (15.10) compiler. The source code is written in OO-Fortran. This software has "layers" of subroutines that call other functions and subroutines. To bring the code over to a GPU using openacc, I've been first attempting to place "$acc routine" directives in each routine that needs to be ported. During compilation, using "pgf90 -acc -Minfo=accel", I receive the following error :

nvvmCompileProgram error: 9.

Error: /tmp/pgacc2lMnIf9lMqx8.gpu (146, 24): parse invalid forward reference to function 'innerroutine_' with wrong type!

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (Test.f90: 1)

This same problem can be reproduced with the following simple fortran program :

PROGRAM Test
IMPLICIT NONE

CONTAINS

 SUBROUTINE OuterRoutine( sol, xF, N )
 !$acc routine
   IMPLICIT NONE
   INTEGER :: N
   REAL(KIND=8) :: sol(0:N,1:3)
   REAL(KIND=8) :: xF(0:N,1:3)
   ! LOCAL
   INTEGER :: i

      DO i = 0, N
         xF(i,1:3) = InnerRoutine( sol(i,1:3) )
      ENDDO

 END SUBROUTINE OuterRoutine
 FUNCTION InnerRoutine( sol ) RESULT( xF )
 !$acc routine
   IMPLICIT NONE
   REAL(KIND=8) :: sol(1:3)
   REAL(KIND=8) :: xF(1:3)

      xF(1) = sol(1)*sol(2)
      xF(2) = sol(1)*sol(3)
      xF(3) = sol(1)*sol(1)

 END FUNCTION InnerRoutine

END PROGRAM Test

Again, compiling the above program with "pgf90 -acc -Minfo=accel" yields the problem.

Does openacc support acc-enabled routines calling other acc-enabled routines ?

If so, what am I doing wrong ?


Solution

  • You're using the OpenACC "routine" directive correctly. The problem here is that we (PGI) don't yet support using "routine" with array-valued functions. The problem being that this support requires the compiler to create a temp array to hold the return value. Meaning that every thread would need to allocate this temp array causing a severe performance penalty. Worse is how to handle sharing the temp array if is a gang or worker level routine.

    We do have open requests for this feature, but it may be awhile before we can address it. In the meantime, can you try inlining the routine? i.e. compile with "-Minline".