c synchronization opencl atomic reduction

OpenCL - Method to perform a reduction

From the following post, I try to implement a sum reduction of an array with this kernel code :

 #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

__kernel void sumGPU ( __global const long *input, 
               __global long *finalSum
               )
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Temporary local value
  local long tempInput;

  tempInput = input[local_id];

  // Variable for final sum 
  local long totalSumIntegerPart[1];

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart[0] = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);

  atom_add(&totalSumIntegerPart[0], tempInput);

  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

}

But the value of finalSum is not the expected value (I have initially set the input array to :

 for (i=0; i<nWorkItems; i++)
    input[i] = i+1;

So, I expect with nWorkItems = 1024 : finalSum = nWorkItems*(nWorkItems+1)/2=524800

And actually, I get finalSum = 16384.

I get this result by taking a sizeWorkGroup = 16 and nWorkItems = 1024.

Strangely, with sizeWorkGroup = 32 and nWorkItems = 1024, I get another value : finalSum = 32768

I don't understand the last instruction (which is supposed to compute the sum of each partial sum, i.e for each workgroup) :

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

Indeed, I would have thought that instruction atom_add(finalSum, totalSumIntegerPart[0]); would be independent of the local_id if condition.

The most important is this instruction has to be executed "number of workGroups" times (supposing that finalSum is a shared value between all workGroups, isn't it ?).

So I thought I could replace :

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

 // Perform sum of each workGroup sum
      if (local_id==0)
        atom_add(finalSum, totalSumIntegerPart[0]);

Anyone could help to find the right value with my parameters (sizeWorkGroup = 16 and nWorkItems = 1024), i.e a finalSum equal to 524800 ?

or exlain to me why this final sum is not well performed ?

UPDATE :

Here's the kernel code on the following link (it is slightly different from mine because atom_add here only increment 1 for each workitem) :

kernel void AtomicSum(global int* sum)

{
 local int tmpSum[1]; 
 if(get_local_id(0)==0){
 tmpSum[0]=0;}

barrier(CLK_LOCAL_MEM_FENCE);         
atomic_add(&tmpSum[0],1);         
barrier(CLK_LOCAL_MEM_FENCE);    

if(get_local_id(0)==(get_local_size(0)-1)){
  atomic_add(sum,tmpSum[0]);
 }

}

Is this a valid kernel code, I mean, which gives good results ?

Maybe a solution could be to put at the begin of my kernel code :

unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];

I am going to test it and keep you informed.

Thanks

Solution

This line is wrong:

tempInput = input[local_id];

Should be:

tempInput = input[get_global_id(0)];

You are always summing the first area of your input, which is consistent with your weird results. And why it depends on the parameters of work group size.

16*16*64 = 16384
32*32*32 = 32768

Also your code can be simplified a bit:

  uint local_id = get_local_id(0);

  // Variable for final sum 
  local long totalSumIntegerPart;

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);    
  atom_add(&totalSumIntegerPart, input[get_global_id(0)]);    
  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==0)
    atom_add(finalSum, totalSumIntegerPart);

And I would not abuse as you do of atomics, because they are not the most efficient way of doing reductions. You can probably get 10x more speed with proper reduction methods. However, it is ok as a PoC or for learning local memory and CL.