CUDA Reduction - atomic vs single thread summation

I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).

if(i==0){
        *sum = 0.; // Initialize to 0
    }
__syncthreads();
if (tdx == 0){       
    x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();

Then I want to sum the first x elements (as many as I have blocks). When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":

if(i == 0){    
    for(int k = 0; k < gridDim.x; k++){
        atomicAdd(sum, x[k]); //Works good
       //sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
    }
}

Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!

Solution

__syncthreads() only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.

The incorrect result is due to a synchronization problem. The operands x[k] are the outcomes of the computations from different blocks: x[0] is the result from block 0, x[1] is the result from block 1, etc. Thread 0 could start adding them up before some blocks have really finished their computations.

You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k]; can now work.