How does the warp loop unrolling work in Harris' Parallel Reduction tutorial?

I am following the reduction in CUDA presentation by Mark Harris. I've gotten to optimization step #5 and I am confused by the main logic of warpReduce() function:

__device__ void warpReduce(volatile int* sdata, int tid) {
  sdata[tid] += sdata[tid + 32]; // line A
  sdata[tid] += sdata[tid + 16]; // line B
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

My question is regarding line A: Why do we need sdata[tid] += sdata[tid + 32]? if tid < 32, then it should start from sdata[tid] += sdata[tid + 16]? Otherwise it will be out-of-range?

Solution

The explanation is that each warp, in a call to the warpReduce() function, handles two input elements, so 32*2 = 64 elements per warp.

Have a look at slide 14 in the slide deck you linked to - you'll see the number of threads is half the number of elements they're working on.

But I'll agree this is a bit surprising/confusing given how, in earlier slides, the addition offset s has condition s < blockDim.x, so that only blockDim.x elements are processed.