Cuda Kernel with reduction - logic errors for dot product of 2 matrices

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.

This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks!

Here is the regular code

float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b

float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
    hx += ha[i] * hb[i];

Here is my cuda kernel

__global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     dc[tid] = 0;
     for (int stride = 1; stride < n; stride *= 2) {
         if (tid % (2 * stride) == 0)
                 dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
         __syncthreads();
     }
 }

My complete code : http://pastebin.com/zS85URX5

Solution

Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.

In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2];

But this doesn't make sense and is not what we want at all. What we want is something like:

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.