Why Nvidia and AMD OpenCL reduction example did not reduce an array to an element in one go?

I am working on some OpenCL reduction and I found AMD and Nvidia both has some example like the following kernel (this one is taken from Nvidia's website, but AMD has a similar one):

__kernel void reduce2(__global T *g_idata, __global T *g_odata, unsigned int n, __local T* sdata){
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);

sdata[tid] = (i < n) ? g_idata[i] : 0;

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem
for(unsigned int s=get_local_size(0)/2; s>0; s>>=1)
{
    if (tid < s)
    {
        sdata[tid] += sdata[tid + s];
    }
    barrier(CLK_LOCAL_MEM_FENCE);
}

// write result for this block to global mem
if (tid == 0) g_odata[get_group_id(0)] = sdata[0];}

I have two questions:

the code above reduce an array to another smaller array, I am just wondering why all the example I saw did the same instead of reducing an array directly into a single element, which is the usual semantic of "reduction" (IMHO). This should be easily achievable with an outer loop inside the kernel. Is there special reason for this?
I have implemented this reduction and found it quite slow, is there any optimisation I can do to improve it? I saw another example used some unrolling to avoid synchronisation in the loop, but I did not quite get the idea, can you explain a bit?

Solution

The reduction problem in a multithread environment is a very special parallel problem. There is one path that needs to be done sequentially, which is the element 0 to the power of 2. Even if you had infinite threads for processing, you will need log2(N) passes trough the array to reduce it to a single element.

In a real system your number of threads (work-items) are reduced but high (~128-2048). So, in order to use them efficiently all of them have to have something to do. But since the problem is more and more serial and less parallel as you reduce the size of the reduction. These algorithms only bother about the high part, and let the CPU do the rest of the reduction.

To make the story short. You can reduce an array from 1024 to 512 in one pass, but you need the same power to reduce it from 2 to 1. In the latter case all the threads minus 1 are idle, an incredible waste of GPU resources (99.7% idle).

As you can see, there is no point in reducing this last part on a GPU. It is easier to simply copy it to CPU and do it sequentially.

Answering your question: Yes, it is slow, and will always be. If there was a magic trick to solve it, then AMD and nVIDIA would be using it don't you think? :)