Combining texture memory Unified Memory in CUDA 6

I am writing a CUDA application for Jetson TK1 using CUDA 6. I have got the impression from Mark Harris in his blog post

Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere

that the memory of the Tegra K1 is physically unified. I have also observed results indicating that cudaMallocManaged is significantly faster for global memory than ordinary cudaMemcpy. This is probably because the Unified Memory doesn't require any copying.

However, what do I do when I want to use the texture memory for parts of my application? I have not found any support for textures using cudaMallocManaged so I have assumed that I have to use normal cudaMemcpyToArray and bindTextureToArray?

Using the previous mentioned method often seem to work but the variables managed by cudaMallocManaged sometimes give weird segmentation faults for me. Is this the right way to use texture memory along with Unified Memory? The following code illustrates how I do it. This code works fine but my question is whether this is the right way to go or if it might create undefined behaviour that could cause e.g. segmentation faults.

#define width 16
#define height 16
texture<float, cudaTextureType2D, cudaReadModeElementType> input_tex;

__global__ void some_tex_kernel(float* output){
    int i= threadIdx.x;
    float x = i%width+0.5f;
    float y =  i/width+0.5f;
    output[i] = tex2D(input_tex, x, y);
}

int main(){
    float* out;
    if(cudaMallocManaged(&out, width*height*sizeof(float))!= cudaSuccess)
        std::cout << "unified not working\n";

    for(int i=0; i< width*height; ++i){
        out[i] = float(i);
    }

    const cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
    cudaArray* input_t;
    cudaMallocArray(&input_t, &desc, width, height);
    cudaMemcpyToArrayAsync(input_t, 0, 0, out, width*height*sizeof(float),  cudaMemcpyHostToDevice);

    input_tex.filterMode = cudaFilterModeLinear;
    cudaBindTextureToArray(input_tex, input_t, desc);

    some_tex_kernel<<<1, width*height>>>(out);
    cudaDeviceSynchronize();

    for(int i=0;i<width*height; ++i)
        std::cout << out[i] << " ";

    cudaFree(out);
    cudaFreeArray(input_t); 
    }
}

Another thing that I find odd is that if I remove the cudaDeviceSynchronize() in the code I always get segmentation faults. I understand that the result might not be finished if I read it without a synchronization but should not the variable still be accessible?

Anyone have a clue?

Mattias

Solution

The only managed memory possibilities at this time are static allocations using __device__ __managed__ or dynamic allocations using cudaMallocManaged(). There is no direct support for textures, surfaces, constant memory, etc.

Your usage of texturing is fine. The only overlap between texture usage and managed memory is in the following call:

cudaMemcpyToArrayAsync(input_t, 0, 0, out, width*height*sizeof(float),  cudaMemcpyHostToDevice);

where managed memory is the source (i.e. host side) of the transfer. This is acceptable as long as the call is issued during a period when no kernels are executing (see below).

"Another thing that I find odd is that if I remove the cudaDeviceSynchronize() in the code I always get segmentation faults."

cudaDeviceSynchronize(); is necessary after a kernel call to make the managed memory visible to the host again. I suggest you read this section of the documentation carefully:

"In general, it is not permitted for the CPU to access any managed allocations or variables while the GPU is active. Concurrent CPU/GPU accesses, ... will cause a segmentation fault..."

As you've indicated, the code you posted works fine. If you have other code that has unpredictable seg faults while using managed memory, I would carefully inspect the code flow (especially if you are using streams i.e. concurrency) to make sure that the host is accessing managed data only after a cudaDeviceSynchronize(); has been issued, and before any subsequent kernel calls.