Search code examples
cudanvidiapragmaloop-unrolling

What does #pragma unroll do exactly? Does it affect the number of threads?


I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique

__global__ void kernel(float *b, int size)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
 #pragma unroll
    for(int i=0;i<size;i++)
        b[i]=i;
}

Above is my kernel function. In main I call it like below

int main()
{
    float * a; //host array
    float * b; //device array
    int size=100;

    a=(float*)malloc(size*sizeof(float));
    cudaMalloc((float**)&b,size);
    cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);

    kernel<<<1,size>>>(b,size); //size=100

    cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);

    for(int i=0;i<size;i++)
        cout<<a[i]<<"\t";

    _getch();

    return 0;
}

Does it mean I have size*size=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?


Solution

  • No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.

    #pragma unroll is a compiler optimization that can, for example, replace a piece of code like

    for ( int i = 0; i < 5; i++ )
        b[i] = i;
    

    with

    b[0] = 0;
    b[1] = 1;
    b[2] = 2;
    b[3] = 3;
    b[4] = 4;
    

    by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.

    Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.

    If you are sure that the size is 100 for all executions, you can unroll your loop like below:

    #pragma unroll
    for(int i=0;i<SIZE;i++)  //or simply for(int i=0;i<100;i++)
        b[i]=i;
    

    in which SIZE is known in compile time with #define SIZE 100.

    I also suggest you to have proper CUDA error checking in your code (explained here).