Search code examples
cudagpgpu

Why do CUDA kernels have to check `if (index < n)` before doing anything?


This is the very definition of annoying boilerplate. Clearly the kernel should never be invoked where index >= n.

OpenCL doesn't require that you do such a bounds check, it is already done for you outside of the kernel.


Solution

  • Shorter:

    CUDA has no independent definition of the global thread space, that could be used to limit the total threads to be consistent with the problem size. Therefore this limiting, if needed, must be done in kernel code. OpenCL provides an independent definition of the global work-item space/size, and this is used by the launch mechanism to make sure that no more than that many work-items are launched. If this suffices for problem size, no further in-kernel conditioning is necessary.

    Longer:

    In CUDA, the only definition of the "global thread space" (CUDA typically might call this "the grid") that we have is given by the grid dimension (number of blocks) and the block dimension (number of threads per block) in the kernel-launch syntax, for example using typical CUDA runtime API syntax:

    kernel<<<number_of_blocks, threads_per_block,...>>>(...);
    

    The second number roughly corresponds to the local work size in OpenCL, and the product of the two numbers roughly corresponds to the global work size in OpenCL. In CUDA, there is no other way to specify the "global thread space" (corresponding to the global work size in OpenCL, i.e. the total number of work-items launched in OpenCL).

    In CUDA, then, the "global thread space" is given by the product of these 2 numbers indicated at kernel launch, and therefore we often end up in a situation where it is convenient to specify a grid size that is larger than the needed number of threads (probably determined by problem size rather than grid size). The reasons for this are well covered in various forum posts, but arise fundamentally out the granular nature of grid specification this implies. For example, see here for a discussion of some calculation considerations.

    When the grid size is larger than the needed number of threads, it is extremely common (and often necessary, to prevent for example out-of-bounds indexing) to use what I refer to as a "thread check" in the kernel (1D example):

    __global__ void kernel(..., size_t N){
      size_t idx = blockIdx.x*blockDim.x+threadIdx.x;  // get globally unique thread ID
      if (idx < N) { // make sure thread will be "in-bounds" for problem space
        ...   // body of kernel
      }
    }
    

    Even if we wanted to eliminate this "boiler plate" using an automatic mechanism of some sort at the kernel launch point, we have no definition to do so. There is no number provided by the programmer or the launch API(s) that the CUDA runtime could "automatically" use to further limit the number of threads that are launched, less than the grid definition given by the launch configuration arguments.

    In OpenCL, however, we have a separate, independent definition of the "global thread space", namely the global_work_size parameter of clEnqueueNDRangeKernel. This argument is provided independently of any other launch parameter, and therefore we have an "independent" definition of the "global thread space" (the global work-item space) which isn't subject to any "granular specification" necessities. The launch mechanism therefore can and does limit the global work-item space/size to be equal to this number. In situations where that suffices for the actual problem size, no further "boiler plate" conditioning of active work-items is necessary.

    As an aside, not really relevant to the question, OpenCL extends the "knowledge" of this global work-item "space" into the kernel api as well, using for example get_global_size() and relatedly get_global_id(). CUDA has no corresponding definition, and therefore, no corresponding kernel API. Instead, CUDA programmers will typically determine global grid dimensions using a product of the supplied dimension built-in variables (and will typically assemble a globally unique ID using the canonical arithmetic - for the 1D case - that I have already indicated in the kernel example above).

    My uses of "CUDA" above should primarily have CUDA C++ in view. There are some minor differences when we are talking about another CUDA language binding, such as Numba CUDA python, however the general idea of a lack of an independent global space definition applies there as well.