Writing to a floating point OpenGL texture in CUDA via a surface

I'm writing an OpenGL/CUDA (6.5) interop application. I get a compile time error trying to write a floating point value to an OpenGL texture through a surface reference in my CUDA kernel.

Here I give a high level description of how I set up the interop, but I am successfully reading from my texture in my CUDA kernel, so I believe this is done correctly. I have an OpenGL texture declared with

glTexImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, GL_RGB32F_ARB, 512, 512, 0, GL_RGB, GL_FLOAT, NULL);

After creating the texture I call cudaGraphicsGLRegisterImage with cudaGraphicsRegisterFlagsSurfaceLoadStore set. Before running my CUDA kernel, I unbind the texture and call cudaGraphicsMapResources on the cudaGraphicsResource pointers obtained from cudaGraphicsGLRegisterImage. Then I get a cudaArray from cudaGraphicsSubResourceGetMappedArray, create an appropriate resource descriptor for the array, and call cudaCreateSurfaceObject to get a pointer to a cudaSurfaceObject_t. I then call cudaMemcpy with cudaMemcpyHostToDevice to copy the cudaSurfaceObject_t to a buffer on the device allocated by cudaMalloc.

In my CUDA kernel I can read from the surface reference with something like this, and I have verified that this works as expected.

__global__ void cudaKernel(cudaSurfaceObject_t tex) {
    int x = blockIdx.x*blockDim.x + threadIdx.x;
    int y = blockIdx.y*blockDim.y + threadIdx.y;
    float4 sample = surf2Dread<float4>(tex, (int)sizeof(float4)*x, y, cudaBoundaryModeClamp);

In the kernel I want to modify sample and write it back to the texture. The GPU has compute capability 5.0, so this should be possible. I am trying this

surf2Dwrite<float4>(sample, tex, (int)sizeof(float4)*x, y, cudaBoundaryModeClamp);

But I get the error:

error: no instance of overloaded function "surf2Dwrite" matches the argument list
argument types are: (float4, cudaSurfaceObject_t, int, int, cudaSurfaceBoundaryMode)

I can see in

cuda-6.5/include/surface_functions.h

that there are only prototypes for integral versions of surf2Dwrite that accept a void * for the second argument. I do see prototypes for surf2Dwrite which accept a float4 with a templated surface object, However, I'm not sure how I could declare a templated surface object with OpenGL interop. I haven't been able to find anything else on how to do this. Any help is appreciated. Thanks.

Solution

It turns out the answer was pretty simple, though I don't know why it works. Instead of calling

surf2Dwrite<float4>(sample, tex, (int)sizeof(float4)*x, y, cudaBoundaryModeClamp);

I needed to call

surf2Dwrite(sample, tex, (int)sizeof(float4)*x, y, cudaBoundaryModeClamp);

To be honest I'm not sure I fully understand CUDA's use of templating in c++. Anyone have an explanation?