Grid size in phase #4 of Harris' reduction optimization

I am learning about unrolling loops to optimize kernel computation.

This is a code snippet from the book Professional CUDA C Programming:

if (idx + 4 * blockDim.x <= n)
{
    int a1 = g_idata[idx];
    int a2 = g_idata[idx + blockDim.x];
    int a3 = g_idata[idx + 2 * blockDim.x];
    int a4 = g_idata[idx + 3 * blockDim.x];
    tmpSum = a1 + a2 + a3 + a4;
}

In my understanding, each thread works on 4 data blocks and processes a single element from each data block. So, when we launch kernel, compared with kernel w/o unrolling grid.x, the configuration is changed to reduceSmemUnroll<<<grid.x / 4, block>>>.

Then I have a question about the code snippet from Mark Harris's presentation on parallel reduction on page 32:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;

while (i < n) {
    sdata[tid] += g_idata[i] + g_idata[i+blockSize];
    i += gridSize;
}
__syncthreads();

My question is about how to determine the size of grid when launching the kernel? Should it be grid.x/2 compared to configuration w/o multiple load?

Solution

Yes, it should be half the number of blocks; it says so on the slide with the first occurrence of the code snippet you quoted from in Mark's presentation - already on slide 18:

Halve the number of blocks, and replace single load:

[code snippet]

with two loads and [the] first add of the reduction

Of course, you need to be careful about the sizes. The presentation assumes, for simplicity, that your overall length is a power of 2, so you can always safely divide by 2 while there are multiple elements left. In real life that is not the case, so you may need to allow for slack (e.g. "half the grid size plus one if it was odd").