Search code examples
classoopstructcudaparameter-passing

Behaviour of passing struct as a parameter to a CUDA kernel


I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:

struct protoarray {
    size_t dim1;
    size_t dim2;
    size_t dim3;
    float* data;
};

I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel

__global__ void kernel(curandState_t *state, protoarray arr_device){
    const size_t dim1 = arr_device.dim1;
    const size_t dim2 = arr_device.dim2;
    
    for(size_t j(0); j < dim2; j++){
        for(size_t i(0); i < dim1; i++){
            // Do something 
        }
    }
}

The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?

Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.


Solution

  • This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.

    The struct is passed by copy, so all its contents are copied into shared memory of each block.

    Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).

    When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).

    Generally, how safe is it to pass struct and class into kernels as arguments

    My personal rule of thumb on this matter is: If the struct/class...

    • is trivially-copy-constructible; and
    • all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;

    then it should be safe to pass to a kernel.

    (Now it's a bit more of a rule of thumb, and I actually enforce the first part of this rule in the Modern C++ wrappers for CUDA, in the innermost code for launching kernels.)

    passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?

    No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.

    I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory

    Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:

    and that they should be very lightweight if I want to pass [objects to kernels] by value.

    Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.

    Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.