I'm trying to use a cuda DevicePtr
(which is called a CUdeviceptr
in CUDA-land) returned from foreign code as an accelerate Array
with accelerate-llvm-ptx.
The code I've written below somewhat works:
import Data.Array.Accelerate
(Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
(GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
(Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
(UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)
-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU. The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function. It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
cmyTestCuda :: IO CULLong
-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@. This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral
-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'. It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
res <- cmyTestCuda
let DevicePtr ptrToXs = cullongToDevicePtr res
foreignPtrToXs <- newForeignPtr_ ptrToXs
uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
let shape = Z :. 10 :: DIM1
xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
usedXs = use xs :: Acc (Array DIM1 Float)
usedYs = use ys :: Acc (Array DIM1 Float)
computation = Acc.zipWith (+) usedXs usedYs
zs = run computation
putStrLn $ "zs: " <> show z
When compiling and running this program, it correctly prints out the result:
zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]
However, from reading through the accelerate and accelerate-llvm-ptx source code, it doesn't seem like this should work.
In most cases, it seems like an accelerate Array
carries around a pointer to array data in HOST memory, and a Unique
value to uniquely identify the Array
. When performing Acc
computations, accelerate will load the array data from HOST memory into GPU memory as needed, and keep track of it with a HashMap
indexed by the Unique
.
In the code above, I am creating an Array
directly with a pointer to GPU data. This doesn't seem like it should work, but it appears to work in the above code.
However, some things don't work. For instance, trying to print out xs
(my Array
with a pointer to GPU data) fails with a segfault. This makes sense, since the Show
instance for Array
just tries to peek
the data from the HOST pointer. This fails because it is not a HOST pointer, but a GPU pointer:
-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs
Is there a proper way to take a CUDA DevicePtr
and use it directly as an accelerate Array
?
Actually, I am surprised that the above worked as well as it did already; I couldn't replicate that.
One of the problems here is that device memory is implicitly associated with an execution context; pointers in one context are not valid in a different context, even on the same GPU (unless you explicitly enable peer memory access between those contexts).
So, there are actually two components to this problem:
Here is the C code we'll use to generate data on the GPU:
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
CUdeviceptr generate_gpu_data()
{
CUresult status = CUDA_SUCCESS;
CUdeviceptr d_arr;
const int N = 32;
float h_arr[N];
for (int i = 0; i < N; ++i) {
h_arr[i] = (float)i;
}
status = cuMemAlloc(&d_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemAlloc failed (%d)\n", status);
exit(1);
}
status = cuMemcpyHtoD(d_arr, (void*) h_arr, N*sizeof(float));
if (CUDA_SUCCESS != status) {
fprintf(stderr, "cuMemcpyHtoD failed (%d)\n", status);
exit(1);
}
return d_arr;
}
And the Haskell/Accelerate code which uses it:
{-# LANGUAGE ForeignFunctionInterface #-}
import Data.Array.Accelerate as A
import Data.Array.Accelerate.Array.Sugar as Sugar
import Data.Array.Accelerate.Array.Data as AD
import Data.Array.Accelerate.Array.Remote.LRU as LRU
import Data.Array.Accelerate.LLVM.PTX as PTX
import Data.Array.Accelerate.LLVM.PTX.Foreign as PTX
import Foreign.CUDA.Driver as CUDA
import Text.Printf
main :: IO ()
main = do
-- Initialise CUDA and create an execution context. From this we also create
-- the context that our Accelerate programs will run in.
--
CUDA.initialise []
dev <- CUDA.device 0
ctx <- CUDA.create dev []
ptx <- PTX.createTargetFromContext ctx
-- When created, a context becomes the active context, so when we call the
-- foreign function this is the context that it will be executed within.
--
fp <- c_generate_gpu_data
-- To import this data into Accelerate, we need both the host-side array
-- (typically the only thing we see) and then associate this with the existing
-- device memory (rather than allocating new device memory automatically).
--
-- Note that you are still responsible for freeing the device-side data when
-- you no longer need it.
--
arr@(Array _ ad) <- Sugar.allocateArray (Z :. 32) :: IO (Vector Float)
LRU.insertUnmanaged (ptxMemoryTable ptx) ad fp
-- NOTE: there seems to be a bug where we haven't recorded that the host-side
-- data is dirty, and thus needs to be filled in with values from the GPU _if_
-- those are required on the host. At this point we have the information
-- necessary to do the transfer ourselves, but I guess this should really be
-- fixed...
--
-- CUDA.peekArray 32 fp (AD.ptrsOfArrayData ad)
-- An alternative workaround to the above is this no-op computation (this
-- consumes no additional host or device memory, and executes no kernels).
-- If you never need the values on the host, you could ignore this step.
--
let arr' = PTX.runWith ptx (use arr)
-- We can now use the array as in a regular Accelerate computation. The only
-- restriction is that we need to `run*With`, so that we are running in the
-- context of the foreign memory.
--
let r = PTX.runWith ptx $ A.fold (+) 0 (use arr')
printf "array is: %s\n" (show arr')
printf "sum is: %s\n" (show r)
-- Free the foreign memory (again, it is not managed by Accelerate)
--
CUDA.free fp
foreign import ccall unsafe "generate_gpu_data"
c_generate_gpu_data :: IO (DevicePtr Float)