Search code examples
cudaparallel-processinggpureduction

Reduction of odd number of elements CUDA


It seems that it possible to make reduction only for odd number of elements. For example, it needs to sum up numbers. When I have even number of elements, it will be like this:

1 2 3 4
1+2
3+3
6+4

But what to do when I have, for instance 1 2 3 4 5? The last iteration is the sum of three elements 6+4+5 or what? I saw the same question here, but couldn't find the answer.


Solution

  • A parallel reduction will add pairs of elements first:

    1  1+3   4+6
    2  2+4
    3
    4
    

    Your example with an odd number of elements would typically be realized as:

    1  1+4  5+3  8+7
    2  2+5  7+0
    3  3+0
    4  0+0
    5
    0
    0
    0
    

    That is to say, typically a parallel reduction will work with a power-of-2 set of threads, and at most one threadblock (the last one) will have less than a full complement of data to work with. The usual method to handle this is to zero-pad the data out to the threadblock size. If you study the cuda parallel reduction sample code, you'll find examples of this.