From the following post, I try to implement a sum reduction of an array with this kernel code :
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
__kernel void sumGPU ( __global const long *input,
__global long *finalSum
)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Temporary local value
local long tempInput;
tempInput = input[local_id];
// Variable for final sum
local long totalSumIntegerPart[1];
// Initialize sums
if (local_id==0)
totalSumIntegerPart[0] = 0;
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add(&totalSumIntegerPart[0], tempInput);
barrier(CLK_LOCAL_MEM_FENCE);
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
}
But the value of finalSum
is not the expected value (I have initially set the input
array to :
for (i=0; i<nWorkItems; i++)
input[i] = i+1;
So, I expect with nWorkItems = 1024
: finalSum = nWorkItems*(nWorkItems+1)/2=524800
And actually, I get finalSum = 16384
.
I get this result by taking a sizeWorkGroup = 16
and nWorkItems = 1024
.
Strangely, with sizeWorkGroup = 32
and nWorkItems = 1024
, I get another value : finalSum = 32768
I don't understand the last instruction (which is supposed to compute the sum of each partial sum, i.e for each workgroup) :
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
Indeed, I would have thought that instruction atom_add(finalSum, totalSumIntegerPart[0]);
would be independent of the local_id
if condition
.
The most important is this instruction has to be executed "number of workGroups
" times (supposing that finalSum is a shared value between all workGroups, isn't it ?).
So I thought I could replace :
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
by
// Perform sum of each workGroup sum
if (local_id==0)
atom_add(finalSum, totalSumIntegerPart[0]);
Anyone could help to find the right value with my parameters (sizeWorkGroup = 16
and nWorkItems = 1024
), i.e a finalSum
equal to 524800
?
or exlain to me why this final sum is not well performed ?
UPDATE :
Here's the kernel code on the following link (it is slightly different from mine because atom_add
here only increment 1 for each workitem) :
kernel void AtomicSum(global int* sum)
{
local int tmpSum[1];
if(get_local_id(0)==0){
tmpSum[0]=0;}
barrier(CLK_LOCAL_MEM_FENCE);
atomic_add(&tmpSum[0],1);
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0)==(get_local_size(0)-1)){
atomic_add(sum,tmpSum[0]);
}
}
Is this a valid kernel code, I mean, which gives good results ?
Maybe a solution could be to put at the begin of my kernel code :
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];
I am going to test it and keep you informed.
Thanks
This line is wrong:
tempInput = input[local_id];
Should be:
tempInput = input[get_global_id(0)];
You are always summing the first area of your input, which is consistent with your weird results. And why it depends on the parameters of work group size.
16*16*64 = 16384
32*32*32 = 32768
Also your code can be simplified a bit:
uint local_id = get_local_id(0);
// Variable for final sum
local long totalSumIntegerPart;
// Initialize sums
if (local_id==0)
totalSumIntegerPart = 0;
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add(&totalSumIntegerPart, input[get_global_id(0)]);
barrier(CLK_LOCAL_MEM_FENCE);
// Perform sum of each workGroup sum
if (local_id==0)
atom_add(finalSum, totalSumIntegerPart);
And I would not abuse as you do of atomics, because they are not the most efficient way of doing reductions. You can probably get 10x more speed with proper reduction methods. However, it is ok as a PoC or for learning local memory and CL.