I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.
This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks!
Here is the regular code
float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
Here is my cuda kernel
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
My complete code : http://pastebin.com/zS85URX5
Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.
In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
But this doesn't make sense and is not what we want at all. What we want is something like:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n
parameter. Since you are launching the kernel with n
threads, the blockDim.x
built-in variable is equal to n
in this case.