Search code examples
cudaloop-unrolling

How does the warp loop unrolling work in Harris' Parallel Reduction tutorial?


I am following the reduction in CUDA presentation by Mark Harris. I've gotten to optimization step #5 and I am confused by the main logic of warpReduce() function:

__device__ void warpReduce(volatile int* sdata, int tid) {
  sdata[tid] += sdata[tid + 32]; // line A
  sdata[tid] += sdata[tid + 16]; // line B
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

My question is regarding line A: Why do we need sdata[tid] += sdata[tid + 32]? if tid < 32, then it should start from sdata[tid] += sdata[tid + 16]? Otherwise it will be out-of-range?


Solution

  • The explanation is that each warp, in a call to the warpReduce() function, handles two input elements, so 32*2 = 64 elements per warp.

    Have a look at slide 14 in the slide deck you linked to - you'll see the number of threads is half the number of elements they're working on.

    But I'll agree this is a bit surprising/confusing given how, in earlier slides, the addition offset s has condition s < blockDim.x, so that only blockDim.x elements are processed.