OpenCL - Atomic operation with double - works until limit

Following this link, I try to implement an atomic function that computes the sum of an array of double, so I implemented my own atom_add function (for double).

Here's the kernel code used:

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(__global double *val, double delta)
{
  union {
  double f;
  ulong  i;
  } old, new;

  do
  {
   old.f = *val;
   new.f = old.f + delta;
  } 
  while (atom_cmpxchg((volatile __global ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __global double *finalSum
                 )
{
  // Index of current workItem
  uint gid = get_global_id(0);

  // Init sum
  *finalSum = 0.0;

  // Compute final sum 
  atom_add_double(finalSum, input[gid]);

}

My issue is that kernel codes generates good results until I reach roughly 100000 elements for size of input array.

Over this limit, the computation isn't valid any more (I can check the result easily because in my test case, I fill the input array by a loop for(i=0;i<sizeArray;i++) input[i]=i+1;, so the sum is equal to sizeArray*(sizeArray+1)/2).

Can I define and put a function like atom_add_double into the kernel code?

Solution

*finalSum = 0.0;

is race condition for all in-flight threads. It is making result zero for my computer. Delete it, initialize it from host side. If your gpu is very good one, number of in-flight threads could be as high as 50000 maybe even more and each one hitting finalSum = 0.0 before any begins atomic function but when you pass that limit, 50001st (just a trivial number) thread re-initializes it to zero.

Then, sum of all elements is not equal to size*(size+1)/2 because it is starting from zero(zeroth element is zero) so it is actually

(size-1)*(size)/2

and is giving right results for my computer when I delete finalSum =0.0 from kernel.