I'm trying to get some experience with OpenCL, the environment is setup and I can create and execute kernels. I am currently trying to compute pi in parallel using the Leibniz formula but have been receiving some strange results.
The kernel is as follow:
__kernel void leibniz_cl(__global float *space, __global float *result, int chunk_size)
{
__local float pi[THREADS_PER_WORKGROUP];
pi[get_local_id(0)] = 0.;
for (int i = 0; i < chunk_size; i += THREADS_PER_WORKGROUP) {
// `idx` is the work item's `i` in the grander scheme
int idx = (get_group_id(0) * chunk_size) + get_local_id(0) + i;
float idx_f = 1 / ((2 * (float) idx) + 1);
// Make the fraction negative if needed
if(idx & 1)
idx_f = -idx_f;
pi[get_local_id(0)] += idx_f;
}
// Reduction within workgroups (in `pi[]`)
for(int groupsize = THREADS_PER_WORKGROUP / 2; groupsize > 0; groupsize >>= 1) {
if (get_local_id(0) < groupsize)
pi[get_local_id(0)] += pi[get_local_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
If I end the function here and set result
to pi[get_local_id(0)]
for !get_global_id(0)
(as in the reduction for the first group), printing result
prints -nan
.
Remainder of kernel:
// Reduction amongst workgroups (into `space[]`)
if(!get_local_id(0)) {
space[get_group_id(0)] = pi[get_local_id(0)];
for(int groupsize = get_num_groups(0) / 2; groupsize > 0; groupsize >>= 1) {
if(get_group_id(0) < groupsize)
space[get_group_id(0)] += space[get_group_id(0) + groupsize];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_global_id(0) == 0)
*result = space[get_group_id(0)] * 4;
}
Returning space[get_group_id(0)] * 4
returns either -nan
or a very large number which clearly is not an approximation of pi.
I can't decide if it is an OpenCL concept I'm missing or a parallel execution one in general. Any help is appreciated.
Links
Reduction template: OpenCL float sum reduction
Leibniz Formula: https://www.wikiwand.com/en/Leibniz_formula_for_%CF%80
Maybe these are not most critical issues with the code but they can be the source of problem:
barrier(CLK_LOCAL_MEM_FENCE);
before local reduction. This can be avoided if only you know that work group size is equal or smaller than number of threads in wavefront running same instruction in parallel - 64 for AMD GPUs, 32 for NVidia GPUs.barrier()
works for work items of same work group only. Clear and 100% working way to insert a barrier into kernel is splittion it in two in the place where global barier is needed.