Search code examples
parameterscudagpunvcc

Does the CUDA compiler optimize the kernel based on the passed parameters?


I have a simple CUDA kernel that counts number of A's in a 1000 byte fragment of a very large string. The database is laid out so that memory accesses are coalesced. After returning from the kernel my main function copies the device array results to one on the host for further analysis.

__global__ void kernel(unsigned int jobs_todo, char* database, float* results ) {

  unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
  float A = 0; int i; char ch;
  if(id < jobs_todo) {
    for(i = 0; i < 1000; i += 1){
     ch = database[jobs_todo*i + id];
     if(ch == 'A') A++;
   }
  results[id] = A;
}

The kernel runs fine. However, if I replace results[id]=A with something trivial like results[id]=10 or just comment out that line it runs much faster (10 times) and uses much fewer registers as given by --ptxas-options=-v. The kernel doesn't help if I comment out that line. Does the CUDA compiler know this by looking at the passed parameters? And so it chooses to do nothing?


Solution

  • What you are seeing is the result of compiler optimisation. The compile will prune "dead" code, that is code which doesn't directly result in a memory write. So your kernel

    __global__ void kernel(unsigned int jobs_todo, char* database, float* results ) {
    
      unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
      float A = 0; int i; char ch;
      if(id < jobs_todo) {
        for(i = 0; i < 1000; i += 1){
         ch = database[jobs_todo*i + id];
         if(ch == 'A') A++;
       }
       results[id]=10;
    }
    

    is effectively optimised to

    __global__ void kernel(unsigned int jobs_todo, char* database, float* results ) {
    
      unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
      results[id]=10;
    }
    

    Obviously the register footprint and execution time of the reduced code is much lower than your full code. You can confirm this by compiling your code to PTX and inspecting the emitted code.