Returning a pointer to a device allocated matrix from C to Fortran

First off, I'm kind of new with Fortran/C/CUDA. Secondly, I'm working on a Fortran/C program that performs matrix vector multiplication on the GPU using cuBLAS. I need to multiply multiple (up to 1000) vectors with the one matrix before I need to update the matrix contents. However, the current version I have has to reallocate the matrix every time a new vector is sent to the GPU (which is incredibly wasteful and slow since the matrix hasn't changed).

I want to be able to multiply the matrix with the vector without having to reallocate the matrix for every vector. An idea I had involved calling a separate C function that would allocate the matrix to the GPU, returns a pointer to the Fortran main program, and then calls another C function that performs the matrix vector multiplication.

Using ISO_C_BINDING, I returned a pointer to a floating point number into the variable:

type(C_PTR) :: ptr

and when I try to pass this into the matrix vector C function:

in Fortran

call cudaFunction(ptr,vector, N)

in C

extern "C" void cudaFunction_(float *mat, float *vector, int *N)

everything compiles and runs, but the execution of cublasSgemv fails to execute. Any ideas on why this would be happening? I've seen a few post kind of related but they never try to send the returned pointer back to C and this is where (I believe) I am having the issue.

Thanks in advance!

Solution

I would suggest that you not reinvent the wheel, but use the cublas fortran bindings that are provided for this purpose.

The "thunking" wrapper is not what you want. It does implicit copy operations as needed, any time you use a cublas call in fortran.

You want the "non-thunking" wrapper, so you have explicit control over the copying going on. You can use fortran equivalents of Get/SetMatrix and Get/SetVector to copy data back and forth.

There is a sample code (example B.2) showing how to use the non-thunking wrapper included in the cublas documentation.

Even if you do want to re-invent the wheel, the wrapper will show you how to make the necessary syntax work to move between C and Fortran.

On a standard linux CUDA install, the wrappers are in /usr/local/cuda/src The non-thunking wrapper is /usr/local/cuda/src/fortran.c

Here's a fully worked example:

cublasf.f:

      program cublas_fortran_example
      implicit none
      integer i, j

c     helper functions
      integer cublas_init
      integer cublas_shutdown
      integer cublas_alloc
      integer cublas_free
      integer cublas_set_vector
      integer cublas_get_vector
c     selected blas functions
      double precision cublas_ddot
      external cublas_daxpy
      external cublas_dscal
      external cublas_dcopy
      double precision cublas_dnrm2
c     cublas variables
      integer cublas_status
      real*8 x(30), y(30)
      double precision alpha, beta
      double precision nrm
      integer*8 d_x, d_y, d_alpha, d_beta, d_nrm
      integer*8 dsize1, dlength1, dlength2
      double precision dresult



      write(*,*) "testing cublas fortran example"

c     initialize cublas library
c     CUBLAS_STATUS_SUCCESS=0
      cublas_status = cublas_init()
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS Library initialization failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
c     initialize data
      do j=1,30
        x(j) = 1.0
        y(j) = 2.0
      enddo
      dsize1 = 8
      dlength1 = 30
      dlength2 = 1
      alpha = 2.0
      beta = 3.0
c     allocate device storage
      cublas_status = cublas_alloc(dlength1, dsize1, d_x)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength1, dsize1, d_y)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_alpha)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_beta)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif
      cublas_status = cublas_alloc(dlength2, dsize1, d_nrm)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS device malloc failed"
         stop
      endif

c     copy data from host to device

      cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2,
     >     d_x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2,
     >     d_y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to device failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif

      dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2)
      write(*,*) "dot product result=",dresult

      dresult = cublas_dnrm2(dlength1, d_x, dlength2)
      write(*,*) "nrm2 of x result=",dresult

      dresult = cublas_dnrm2(dlength1, d_y, dlength2)
      write(*,*) "nrm2 of y result=",dresult

      call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "daxpy y(1)  =", y(1)
      write(*,*) "daxpy y(30) =", y(30)


      call cublas_dscal(dlength1, beta, d_x, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2,
     >     x, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dscal x(1)  =", x(1)
      write(*,*) "dscal x(30) =", x(30)


      call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2)
      cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2,
     >     y, dlength2)
      if (cublas_status /= 0) then
         write(*,*) "CUBLAS copy to host failed"
         write(*,*) "cublas_status=",cublas_status
         stop
      endif
      write(*,*) "dcopy y(1)  =", y(1)
      write(*,*) "dcopy y(30) =", y(30)

c     deallocate GPU memory and exit
      cublas_status = cublas_free(d_x)
      cublas_status = cublas_free(d_y)
      cublas_status = cublas_free(d_alpha)
      cublas_status = cublas_free(d_beta)
      cublas_status = cublas_free(d_nrm)
      cublas_status = cublas_shutdown()
      stop
      end

compile/run:

$ gfortran -c -o cublasf.o cublasf.f
$ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c
$ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o
$ ./cublasf
 testing cublas fortran example
 dot product result=   60.0000000000000
 nrm2 of x result=   5.47722557505166
 nrm2 of y result=   10.9544511501033
 daxpy y(1)  =   4.00000000000000
 daxpy y(30) =   4.00000000000000
 dscal x(1)  =   3.00000000000000
 dscal x(30) =   3.00000000000000
 dcopy y(1)  =   3.00000000000000
 dcopy y(30) =   3.00000000000000
$

CUDA 5.0, RHEL 5.5