Following this link, I try to implement an atomic function that computes the sum of an array of double, so I implemented my own atom_add
function (for double).
Here's the kernel code used:
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void atom_add_double(__global double *val, double delta)
{
union {
double f;
ulong i;
} old, new;
do
{
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg((volatile __global ulong *)val, old.i, new.i) != old.i);
}
__kernel void sumGPU ( __global const double *input,
__global double *finalSum
)
{
// Index of current workItem
uint gid = get_global_id(0);
// Init sum
*finalSum = 0.0;
// Compute final sum
atom_add_double(finalSum, input[gid]);
}
My issue is that kernel codes generates good results until I reach roughly 100000 elements for size of input
array.
Over this limit, the computation isn't valid any more (I can check the result easily because in my test case, I fill the input array by a loop for(i=0;i<sizeArray;i++) input[i]=i+1;
, so the sum is equal to sizeArray*(sizeArray+1)/2
).
Can I define and put a function like atom_add_double
into the kernel code?
*finalSum = 0.0;
is race condition for all in-flight threads. It is making result zero for my computer. Delete it, initialize it from host side. If your gpu is very good one, number of in-flight threads could be as high as 50000 maybe even more and each one hitting finalSum = 0.0 before any begins atomic function but when you pass that limit, 50001st (just a trivial number) thread re-initializes it to zero.
Then, sum of all elements is not equal to size*(size+1)/2 because it is starting from zero(zeroth element is zero) so it is actually
(size-1)*(size)/2
and is giving right results for my computer when I delete finalSum =0.0 from kernel.