CUDA reduction optimizations

I am trying to do all the optimizations seen at Nvidia Reduction. I have implemented the first four parts but I am stuck with part #5 at slide no 22.

I am unable to realize the reason as to why the provided code can work without any syncthreads(). The threads have accesses to same memory locations in the output.

Moreover, the slide suggests that the code won't work if the variables are not set to volatile. How does being volatile help in that aspect? If I don't want to call the kernel, what is best way to program it?

I am also putting that code here for reference.

__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}

if (tid < 32) warpReduce(sdata, tid);

Thanks in advance for your help. Please comment if further info needed.

Solution

The code relies on what is called warp-synchronous programming. It was a common practice to avoid __syncthreads() within a warp. However, this behaviour is undocumented and actually now NVIDIA strongly discourages writing code which relies on that behaviour.

From the Kepler tuning guide:

The absence of an explicit synchronization in a program where different threads communicate via memory constitutes a data race condition or synchronization error. Warp-synchronous programs are unsafe and easily broken by evolutionary improvements to the optimization strategies used by the CUDA compiler toolchain

The examples that you mention are included in the samples which come with the CUDA toolkit. If you look into the recent version you will find that this part of the reduction is now implemented with warp shuffle operations for compute capability >= 3.0 and uses __syncthreads() for older devices as you would expect. In older samples (e.g. in the CUDA toolkit 6.0) it was still implemented with the warp synchronous techniques.

If you still want to learn about warp-synchronous programming I recommend this answer.