I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).
if(i==0){
*sum = 0.; // Initialize to 0
}
__syncthreads();
if (tdx == 0){
x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();
Then I want to sum the first x elements (as many as I have blocks). When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":
if(i == 0){
for(int k = 0; k < gridDim.x; k++){
atomicAdd(sum, x[k]); //Works good
//sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
}
}
Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!
__syncthreads()
only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.
The incorrect result is due to a synchronization problem. The operands x[k]
are the outcomes of the computations from different blocks: x[0]
is the result from block 0
, x[1]
is the result from block 1
, etc. Thread 0
could start adding them up before some blocks have really finished their computations.
You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k];
can now work.