Search code examples
cudareduction

How to understand the stride size for parallel reduction when doing multiple add?


I am learning Mark Harris's implementation of Optimizing Parallel Reduction.

I am confused by these several lines on page 32:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;}
__syncthreads();

Why here the grid stride is gridSize = blockSize*2*gridDim.x;? From my understanding, we should compute the grid one by one? So it should be gridSize = blockSize*gridDim.x;?


Solution

  • The answer here is very similar to the answer to your previous question. Each thread in the block (or grid) handles two elements from the data set:

    sdata[tid] += g_idata[i] + g_idata[i+blockSize]
                  element 1     element 2
    

    for every stride that the grid takes. Therefore if we measure the grid stride in terms of the number of elements processed per stride, that number is twice the grid dimension in threads.

    grid size in threads:

    gridSize = blockSize*gridDim.x;
    

    grid size in number of elements processed per stride:

    gridSize = 2*blockSize*gridDim.x;