GPU reduction for billion-element array

I am trying to perform reduction with GPU, that is, to find a maximum value from all the elements of an array. There is a tutorial from Nvidia here, let's say slide 7 for the simplest method

The only problem I have is my array is huge! it can reach 4 billion elements. From the sample code in slide 7, there is a need to copy back-and-forth between block shared memory and global memory and the use of global memory to store all elements cannot be avoided in my current understanding. This storage exceeds 2GB of graphics card memory.

Is there any way to do this reduction with such huge arrays or is it current limits of graphics hardware?

PS: In future extended version, I also plan with much more than 4 billion elements

Solution

Reduction is an operation which you can do in chunks.

The simplest solution would be to allocate two data buffers and two result buffers on the GPU, and then overlap transfer to the GPU with execution of the reduction kernel. Your CPU can reduce the output of successive GPU reductions while the GPU is busy. That way you can amortise most of the cost of data transfer, and the processing of the partial reduction results.

You can do all of this with the standard reduction kernel NVIDIA supply with the CUDA examples.