Search code examples
cudameasurement

Cuda measurement of loop


I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.

__global__ void kernel(){
int x1,x2;

x1=5;
x2=1;

for (int k=0;k<=1000000;k++)
  {
   x1+=x2;

  }
}

The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5

I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.

The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?

Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).


Solution

  • As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.

    To get around both of these problems, rewrite your code like this:

    __global__ 
    void kernel(int *out, int x0, bool flag)
    {
        int x1 = x0, x2 = 1;
    
        for (int k=0; k<=1000000; k++) {
           x1+=x2;
        }
    
        if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
    }
    

    and then run it like this:

    kernel<<<1,512>>>((int *)0, 5, false);
    

    By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.