OpenCL - using atomic reduction for double

I know atomic functions with OpenCL-1.x are not recommended but I just want to understand an atomic example.

The following kernel code is not working well, it produces random final values for the computation of sum of all array values (sum reduction) :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double f;
  ulong  i;
  } old, new;

  do
  {
   old.f = *val;
   new.f = old.f + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);
  local double partialSum;
  local double finalSumTemp;

 // Initialize sums
  if (lid==0)
  {
   partialSum = 0.0;
   finalSumTemp = 0.0;
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum, localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Final sum of partialSums
  if (lid==0)
  {
   atom_add_double(&finalSumTemp, partialSum);
   *finalSum = finalSumTemp;
  }

}

The version with global id strategy works good but the version above, which passes by the using of local memory (shared memory), doesn't give the expected results (the value of *finalSum is random for each execution).

Here the Buffers and kernel args that I have put in my host code :

 // Write to buffers
  ret = clEnqueueWriteBuffer(command_queue, inputBuffer, CL_TRUE, 0,
        nWorkItems * sizeof(double), xInput, 0, NULL, NULL);
  ret = clEnqueueWriteBuffer(command_queue, finalSumBuffer, CL_TRUE, 0,
                      sizeof(double), finalSumGPU, 0, NULL, NULL);

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&finalSumBuffer);

and Finally, I read finalSumBuffer to get the sum value.

I think my issue comes rather from the kernel code but I can't find where is the error.

If anyone could see what's wrong, this would be nice to tell me.

Thanks

UPDATE 1 :

I nearly manage to perform this reduction. Following the propositions suggested by huseyin tugrul buyukisik, I have modified the kernel code like this :

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

void atom_add_double(volatile __local double *val, double delta)
{
  union {
  double d;
  ulong  i;
  } old, new;

  do
  {
   old.d = *val;
   new.d = old.d + delta;
  } 
  while (atom_cmpxchg((volatile __local ulong *)val, old.i, new.i) != old.i);

}  

__kernel void sumGPU ( __global const double *input, 
               __local double *localInput,
               __local double *partialSum,
               __global double *finalSum
                 )
{

  uint lid = get_local_id(0); 
  uint gid = get_global_id(0);
  uint localSize = get_local_size(0);
  uint groupid = get_group_id(0);

  // Initialize partial sums
  if (lid==0)
    partialSum[groupid] = 0.0; 


  barrier(CLK_LOCAL_MEM_FENCE);
  // Set in local memory
  int idx = groupid * localSize + lid;
  localInput[lid] = input[idx];

  // Compute atom_add into each workGroup
  barrier(CLK_LOCAL_MEM_FENCE);
  atom_add_double(&partialSum[groupid], localInput[lid]);
  // See and Check if barrier below is necessary
  barrier(CLK_LOCAL_MEM_FENCE);

  // Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid]; 

}

As said huseyin , I don't need to use atomic functions for the final sum of all partial sums.

So I did at the end :

// Compute final sum
  if (lid==0)
    *finalSum += partialSum[groupid];

But unfortunately, the final sum doesn't give the value expected and the value is random (for example, with nwork-items = 1024 and size-WorkGroup = 16, I get random values in the order of [1e+3 - 1e+4] instead of 5.248e+05 expected.

Here are the setting of arguments into the host code :

 // Set the arguments of the kernel
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
  clSetKernelArg(kernel, 1, local_item_size*sizeof(double), NULL);
  clSetKernelArg(kernel, 2, nWorkGroups*sizeof(double), NULL);
  clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&finalSumBuffer);

Could you see where is my error in the kernel code ?

Thanks

Solution

Not an error but logic issue:

atom_add_double(&finalSumTemp, partialSum);

is working only once per group (by zero-local-indexed thread).

So you are just doing

finalSumTemp = partialSum

so atomics here is not needed.

There is race condition for

*finalSum = finalSumTemp;

between workgroups where each zero-index local thread writes to same address. So this should be the atomic addition (for learning purposes) or could be written on different cells to be added on host side such as sum_group1+sum_group2+... = total sum.

int idx = groupid * localSize + lid;
localInput[lid] = input[idx];

here using groupid is suspicious for multi-device summation. Because each device has its own global range and workgroup id indexings so two device could have same group id values for two different groups. Some device related offset should be used when multiple devices are used. Such as:

idx= get_global_id(0) + deviceOffset[deviceId];

Also if atomic operation is inavoidable, and if exactly N times operated, it could be moved to a single thread(such as 0-indexed thread) and looped for N times(probably being faster) in a second kernel unless that atomic operation latency can't be hidden by other means.