Search code examples
cudapycuda

Create arrays in shared memory w/o templates like in PyOpenCL


How can I create an array in shared memory without modifying the kernel using templates as seen in the official examples. Or is using templates the official way?

In PyOpenCL I can create an array in local memory with setting a kernel argument

kernel.set_arg(1,numpy.uint32(a_width))

... 
KERNEL_CODE = """
__kernel void matrixMul(__local float* A_temp,...)
    { ...} """

Solution

  • CUDA supports dynamic shared memory allocation at kernel run time, but the mechanism is a bit different to OpenCL. In the CUDA runtime API, a kernel using dynamically allocated/sized shared memory and the launch to size the memory uses the following syntax:

    __global__ void kernel(...)
    {
        extern __shared__ typename buffer[];
    
        ....
    }
    ....
    kernel <<< griddim, blockdim, sharedmem, streamID >>> (...)
    

    where sharedmem is the total number of bytes per block which will be allocated to buffer.

    In PyCUDA, the same mechanism works something like this:

    mod = SourceModule("""
        __global__ void kernel(...)
        {
            extern __shared__ typename buffer[];
    
            ....
        }
      """)
    
    func = mod.get_function("kernel")
    func.prepare(..., shared=sharedmem)
    func.prepared_call(griddim,blockdim,...)
    

    with the shared memory allocation size passed to the prepare method.