Search code examples
cudagpugpgpugpu-shared-memory

Are needless write operations in multi-thread kernels in CUDA inefficient?


I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:

__global__ void calc(float * globalmem)
{
    __shared__ float sharemem; //initialize shared memory
    sharemem = 0; //set it to initial value
    __syncthreads();

   //do various calculations on the shared memory
   //for example I use atomicAdd() to add each thread's
   //result to sharedmem...

   __syncthreads();
   *globalmem = sharedmem;//write shared memory to global memory
}

The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?


Solution

  • At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.

    However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).

    It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:

    __global__ void calc(float * globalmem)
    {
        __shared__ float sharemem; //initialize shared memory
        if (!threadIdx.x) sharemem = 0; //set it to initial value
        __syncthreads();
    
       //do various calculations on the shared memory
       //for example I use atomicAdd() to add each thread's
       //result to sharedmem...
    
       __syncthreads();
       if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
    }
    

    Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.