Search code examples
cudapragmaloop-unrolling

Determining the optimal value for #pragma unroll N in CUDA


I understand how #pragma unroll works, but if I have the following example:

__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
  int j = threadIdx.x + blockIdx.x * blockDim.x;
  if (j < array_size) {
     #pragma unroll
     for (int i = 0; i < LIMIT; i++) {
       A_out[i] = B[i] + C[i];
     }
  }
}

I want to determine the optimal value for LIMITin the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?


Solution

  • Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.

    Here is a compileable version of your kernel with a bit of template decoration:

    template<int LIMIT>
    __global__ void
    test_kernel( const float* B, const float* C, float* A_out, int array_size)
    {
      int j = threadIdx.x + blockIdx.x * blockDim.x;
      if (j < array_size) {
         #pragma unroll
         for (int i = 0; i < LIMIT; i++) {
           A_out[i] = B[i] + C[i];
         }
      }
    }
    
    template __global__ void test_kernel<4>(const float*, const float*, float*, int);
    template __global__ void test_kernel<64>(const float*, const float*, float*, int);
    template __global__ void test_kernel<256>(const float*, const float*, float*, int);
    template __global__ void test_kernel<1024>(const float*, const float*, float*, int);
    template __global__ void test_kernel<4096>(const float*, const float*, float*, int);
    template __global__ void test_kernel<8192>(const float*, const float*, float*, int);
    

    You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to LIMIT=4096are fully unrolled. The LIMIT=8192 case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.

    You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).