How are global atomic operations implemented in Kepler? I got less performance using gmem rather than using atomics

I would like to know the implementation of global atomics in Kepler.

see this piece of code:

1. if (threadIdx.x < workers) {
2.    temp = atomicAdd(dst, temp + rangeOffset);
3.    if (isLastPartialCalc(temp)) {                            
4.        atomicAdd(dst,-300000.0f);
5.    }
6. }

if I change line 4 for this:

*dst -= 300000.0f;

The performance is lower! The change is safe, since no more threads will write on this value (the output are the same).

kernel using atomic: ~883us kernel using gmem directly: ~903us

I have run several times and I always get this ~20us penalty for the change

UPDATE It seems the store without using the atomic always produces a miss in L2 whilst the atomic version always produces a hit ... so I guess trying to write to some location which was flagged (or something) with "atomic" is not allowed in L2 and it makes another request to gmem

Solution

This cache line update is apparently more costly (in your particular code) than just the second global atomic access.

A single global atomic access from a single SM to global memory on Kepler GK110 (e.g. K20) is actually quite fast.

As indicated in the Kepler white paper, Kepler has improved speed of global atomics as compared to Fermi.

Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock.