Search code examples
parallel-processingcudaatomicnvidia

replacing blockId with a counter


Initially I used blockIdx.x in my code but I want to remove that and instead have a global value and use that in my block instead of blockidx.x. Since my code is too large and it hangs when I run it with large input sizes I thought think can help. I increment the counter atomically but when I run the code it hangs. Can anyone take a look at my code and see if I'm doing something wrong?

__device__ int counter = 0;

__global__ void kernel(int * ginput, int * goutput)
{
  const int tid = threadIdx.x;
  const int id = threadIdx.x + blockIdx.x * blockDim.x;
  in myval = ginput[id];  

  if (tid == 0) {
    atomicAdd(&counter, 1);
  }

  __syncthreads();
  if (counter == 0) {
    goutput[tid] = ...;
  }
  if (counter > 0) {
   ...
  }

}

If I use blockIdx.x instead of counter in my code it works but I just want to replace it with the counter


Solution

  • If you want the counter to replace your usage of blockIdx.x (i.e. you want each block to have a unique value that it reads from counter), then something like this should work:

    __device__ int counter = 0;
    
    __global__ void kernel(int * ginput, int * goutput)
    {
      const int tid = threadIdx.x;
      const int id = threadIdx.x + blockIdx.x * blockDim.x;
      __shared__ int my_block_id;
    
    
      if (tid == 0) {
        my_block_id = atomicAdd(&counter, 1);
      }
    
      __syncthreads();
      if (my_block_id == 0) {
        goutput[tid] = ...;
      }
      if (my_block_id > 0) {
       ...
      }
    
    }
    

    Your method is going to be troublesome, because if you do something like this:

    if (counter > 5) ....
    

    you are potentially reading a new updated value of counter from global memory, and any number of blocks may have updated that value, so the behavior will be unpredictable.