Search code examples
cudagpuopenclnvidiagpu-constant-memory

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?


I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).

So, I have two questions:

  1. How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)

  2. How can I initialize (populate) those arrays from values that are computed at the run time on the host?

I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.


Solution

  • For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:

    auto kernel = R"(
    ..
    __constant__ @@Type@@ buffer[@@SIZE@@]={
       @@elm@@
    };
    ..
    __global__ void test(int * input)
    {   }
    
    )";   
    

    then replace @@-pattern words with size and element value information in run-time and compile like this:

    __constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
    

    So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.

    Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.

    Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?