how to use a cuda DevicePtr as an accelerate Array

I'm trying to use a cuda DevicePtr (which is called a CUdeviceptr in CUDA-land) returned from foreign code as an accelerate Array with accelerate-llvm-ptx.

The code I've written below somewhat works:

import Data.Array.Accelerate
       (Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
       (GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
       (Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
       (UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)

-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU.  The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function.  It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
  cmyTestCuda :: IO CULLong

-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@.  This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral

-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'.  It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
  res <- cmyTestCuda
  let DevicePtr ptrToXs = cullongToDevicePtr res
  foreignPtrToXs <- newForeignPtr_ ptrToXs
  uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
  let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
  let shape = Z :. 10 :: DIM1
      xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
      ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
      usedXs = use xs :: Acc (Array DIM1 Float)
      usedYs = use ys :: Acc (Array DIM1 Float)
      computation = Acc.zipWith (+) usedXs usedYs
      zs = run computation
  putStrLn $ "zs: " <> show z

When compiling and running this program, it correctly prints out the result:

zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]

However, from reading through the accelerate and accelerate-llvm-ptx source code, it doesn't seem like this should work.

In most cases, it seems like an accelerate Array carries around a pointer to array data in HOST memory, and a Unique value to uniquely identify the Array. When performing Acc computations, accelerate will load the array data from HOST memory into GPU memory as needed, and keep track of it with a HashMap indexed by the Unique.

In the code above, I am creating an Array directly with a pointer to GPU data. This doesn't seem like it should work, but it appears to work in the above code.

However, some things don't work. For instance, trying to print out xs (my Array with a pointer to GPU data) fails with a segfault. This makes sense, since the Show instance for Array just tries to peek the data from the HOST pointer. This fails because it is not a HOST pointer, but a GPU pointer:

-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs

Is there a proper way to take a CUDA DevicePtr and use it directly as an accelerate Array?

Solution

Actually, I am surprised that the above worked as well as it did already; I couldn't replicate that.

One of the problems here is that device memory is implicitly associated with an execution context; pointers in one context are not valid in a different context, even on the same GPU (unless you explicitly enable peer memory access between those contexts).

So, there are actually two components to this problem:

Import the foreign data into Accelerate in a way it understands; and
Make sure that subsequent Accelerate computations are executed in a context which has access to this memory.

solution

Here is the C code we'll use to generate data on the GPU:

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

CUdeviceptr generate_gpu_data()
{
  CUresult    status = CUDA_SUCCESS;
  CUdeviceptr d_arr;

  const int N = 32;
  float h_arr[N];

  for (int i = 0; i < N; ++i) {
    h_arr[i] = (float)i;
  }

  status = cuMemAlloc(&d_arr, N*sizeof(float));
  if (CUDA_SUCCESS != status) {
    fprintf(stderr, "cuMemAlloc failed (%d)\n", status);
    exit(1);
  }

  status = cuMemcpyHtoD(d_arr, (void*) h_arr, N*sizeof(float));
  if (CUDA_SUCCESS != status) {
    fprintf(stderr, "cuMemcpyHtoD failed (%d)\n", status);
    exit(1);
  }

  return d_arr;
}

And the Haskell/Accelerate code which uses it:

{-# LANGUAGE ForeignFunctionInterface #-}

import Data.Array.Accelerate                                        as A
import Data.Array.Accelerate.Array.Sugar                            as Sugar
import Data.Array.Accelerate.Array.Data                             as AD
import Data.Array.Accelerate.Array.Remote.LRU                       as LRU

import Data.Array.Accelerate.LLVM.PTX                               as PTX
import Data.Array.Accelerate.LLVM.PTX.Foreign                       as PTX

import Foreign.CUDA.Driver                                          as CUDA

import Text.Printf

main :: IO ()
main = do
  -- Initialise CUDA and create an execution context. From this we also create
  -- the context that our Accelerate programs will run in.
  --
  CUDA.initialise []
  dev <- CUDA.device 0
  ctx <- CUDA.create dev []
  ptx <- PTX.createTargetFromContext ctx

  -- When created, a context becomes the active context, so when we call the
  -- foreign function this is the context that it will be executed within.
  --
  fp  <- c_generate_gpu_data

  -- To import this data into Accelerate, we need both the host-side array
  -- (typically the only thing we see) and then associate this with the existing
  -- device memory (rather than allocating new device memory automatically).
  --
  -- Note that you are still responsible for freeing the device-side data when
  -- you no longer need it.
  --
  arr@(Array _ ad) <- Sugar.allocateArray (Z :. 32) :: IO (Vector Float)
  LRU.insertUnmanaged (ptxMemoryTable ptx) ad fp

  -- NOTE: there seems to be a bug where we haven't recorded that the host-side
  -- data is dirty, and thus needs to be filled in with values from the GPU _if_
  -- those are required on the host. At this point we have the information
  -- necessary to do the transfer ourselves, but I guess this should really be
  -- fixed...
  --
  -- CUDA.peekArray 32 fp (AD.ptrsOfArrayData ad)

  -- An alternative workaround to the above is this no-op computation (this
  -- consumes no additional host or device memory, and executes no kernels).
  -- If you never need the values on the host, you could ignore this step.
  --
  let arr' = PTX.runWith ptx (use arr)

  -- We can now use the array as in a regular Accelerate computation. The only
  -- restriction is that we need to `run*With`, so that we are running in the
  -- context of the foreign memory.
  --
  let r = PTX.runWith ptx $ A.fold (+) 0 (use arr')

  printf "array is: %s\n" (show arr')
  printf "sum is:   %s\n" (show r)

  -- Free the foreign memory (again, it is not managed by Accelerate)
  --
  CUDA.free fp


foreign import ccall unsafe "generate_gpu_data"
  c_generate_gpu_data :: IO (DevicePtr Float)