I am following the reduction in CUDA presentation by Mark Harris. I've gotten to optimization step #5 and I am confused by the main logic of warpReduce()
function:
__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32]; // line A
sdata[tid] += sdata[tid + 16]; // line B
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
My question is regarding line A:
Why do we need sdata[tid] += sdata[tid + 32]
? if tid < 32
, then it should start from sdata[tid] += sdata[tid + 16]
? Otherwise it will be out-of-range?
The explanation is that each warp, in a call to the warpReduce()
function, handles two input elements, so 32*2 = 64 elements per warp.
Have a look at slide 14 in the slide deck you linked to - you'll see the number of threads is half the number of elements they're working on.
But I'll agree this is a bit surprising/confusing given how, in earlier slides, the addition offset s
has condition s < blockDim.x
, so that only blockDim.x elements are processed.