Are needless write operations in multi-thread kernels in CUDA inefficient?

I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:

__global__ void calc(float * globalmem)
{
    __shared__ float sharemem; //initialize shared memory
    sharemem = 0; //set it to initial value
    __syncthreads();

   //do various calculations on the shared memory
   //for example I use atomicAdd() to add each thread's
   //result to sharedmem...

   __syncthreads();
   *globalmem = sharedmem;//write shared memory to global memory
}

The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?

Solution

At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.

However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).

It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:

__global__ void calc(float * globalmem)
{
    __shared__ float sharemem; //initialize shared memory
    if (!threadIdx.x) sharemem = 0; //set it to initial value
    __syncthreads();

   //do various calculations on the shared memory
   //for example I use atomicAdd() to add each thread's
   //result to sharedmem...

   __syncthreads();
   if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}

Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.