I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
*globalmem = sharedmem;//write shared memory to global memory
}
The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?
At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.
However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).
It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
if (!threadIdx.x) sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}
Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.