CUDA reduction, approach for big arrays

I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.

    __global__ void  reduce6(float *g_idata, float *g_odata, unsigned int n)
{
    extern __shared__ float sdata[];

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;
    float mySum = 0;   

    while (i < n) { 
        sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS]; 
        i += gridSize; 
    }
   __syncthreads();


    // do reduction in shared mem
    if (tid < 256)
        sdata[tid] += sdata[tid + 256];
    __syncthreads();

    if (tid < 128)
        sdata[tid] +=  sdata[tid + 128];
     __syncthreads();

    if (tid <  64)
       sdata[tid] += sdata[tid +  64];
    __syncthreads();


#if (__CUDA_ARCH__ >= 300 )
    if ( tid < 32 )
    {
        // Fetch final intermediate sum from 2nd warp
        mySum = sdata[tid]+ sdata[tid + 32];
        // Reduce final warp using shuffle
        for (int offset = warpSize/2; offset > 0; offset /= 2) 
            mySum += __shfl_down(mySum, offset);
    }
    sdata[0]=mySum;
#else

    // fully unroll reduction within a single warp
    if (tid < 32) {
       sdata[tid] += sdata[tid + 32];
       sdata[tid] += sdata[tid + 16];
       sdata[tid] += sdata[tid + 8];
       sdata[tid] += sdata[tid + 4];
       sdata[tid] += sdata[tid + 2];
       sdata[tid] += sdata[tid + 1];
    }
#endif
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
  }

I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n) on a Tesla k40 GPU.

I have some questions regarding the blockSize variable, and its value.

From here on, I will try to explain my understanding (either right or wrong) on how it works:

The bigger I choose blockSize, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x, right? If I use blockSize=1 this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize, and reducing the result of the previous call to reduce, until I get to the smallest point.

something like (pesudocode)

blocks=number; //where do we start? why?
while(not the min){

    dim3 dimBlock( blocks );
    dim3 dimGrid(n/dimBlock.x);
    int smemSize = dimBlock.x * sizeof(float);
    reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);

    in=out;

    n=dimGrid.x; 
    dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
}

In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?

Is my logic somehow flawed? how?

Solution

There is a CUDA tool to get good grid and block sizes for you : Cuda Occupancy API.

In response to "The bigger I choose blockSize, the faster this code will execute" -- Not necessarily, as you want the sizes which give max occupancy (the ratio of active warps to the total number of possible active warps).

See this answer for additional information How do I choose grid and block dimensions for CUDA kernels?.

Lastly, for Nvidia GPUs supporting Kelper or later, there are shuffle intrinsics to make reductions easier and faster. Here is an article on how to use the shuffle intrinsics : Faster Parallel Reductions on Kepler.

Update for choosing number of threads:

You might not want to use the maximum number of threads if it results in a less efficient use of the registers. From the link on occupancy :

For purposes of calculating occupancy, the number of registers used by each thread is one of the key factors. For example, devices with compute capability 1.1 have 8,192 32-bit registers per multiprocessor and can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in one of these devices, for a multiprocessor to have 100% occupancy, each thread can use at most 10 registers. However, this approach of determining how register count affects occupancy does not take into account the register allocation granularity. For example, on a device of compute capability 1.1, a kernel with 128-thread blocks using 12 registers per thread results in an occupancy of 83% with 5 active 128-thread blocks per multi-processor, whereas a kernel with 256-thread blocks using the same 12 registers per thread results in an occupancy of 66% because only two 256-thread blocks can reside on a multiprocessor.

So the way I understand it is that an increased number of threads has the potential to limit performance because of the way the registers can be allocated. However, this is not always the case, and you need to do the calculation (as in the above statement) yourself to determine the optimal number of threads per block.