c parallel-processing opencl pi approximation

OpenCL, C - Leibniz Formula for Pi

I'm trying to get some experience with OpenCL, the environment is setup and I can create and execute kernels. I am currently trying to compute pi in parallel using the Leibniz formula but have been receiving some strange results.

The kernel is as follow:

__kernel void leibniz_cl(__global float *space, __global float *result, int chunk_size) 
{
    __local float pi[THREADS_PER_WORKGROUP];
    pi[get_local_id(0)] = 0.;

    for (int i = 0; i < chunk_size; i += THREADS_PER_WORKGROUP) {
        // `idx` is the work item's `i` in the grander scheme
        int idx = (get_group_id(0) * chunk_size) + get_local_id(0) + i;
        float idx_f = 1 / ((2 * (float) idx) + 1);

        // Make the fraction negative if needed
        if(idx & 1)
            idx_f = -idx_f;

        pi[get_local_id(0)] += idx_f;
    }

    // Reduction within workgroups (in `pi[]`)
    for(int groupsize = THREADS_PER_WORKGROUP / 2; groupsize > 0; groupsize >>= 1) { 
        if (get_local_id(0) < groupsize) 
            pi[get_local_id(0)] += pi[get_local_id(0) + groupsize];

        barrier(CLK_LOCAL_MEM_FENCE);
    }

If I end the function here and set result to pi[get_local_id(0)] for !get_global_id(0) (as in the reduction for the first group), printing result prints -nan.

Remainder of kernel:

    // Reduction amongst workgroups (into `space[]`)
    if(!get_local_id(0)) {
        space[get_group_id(0)] = pi[get_local_id(0)];

        for(int groupsize = get_num_groups(0) / 2; groupsize > 0; groupsize >>= 1) { 
            if(get_group_id(0) < groupsize)
                space[get_group_id(0)] += space[get_group_id(0) + groupsize];

            barrier(CLK_LOCAL_MEM_FENCE);
        }
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    if(get_global_id(0) == 0)
        *result = space[get_group_id(0)] * 4;
}

Returning space[get_group_id(0)] * 4 returns either -nan or a very large number which clearly is not an approximation of pi.

I can't decide if it is an OpenCL concept I'm missing or a parallel execution one in general. Any help is appreciated.

Links

Reduction template: OpenCL float sum reduction

Leibniz Formula: https://www.wikiwand.com/en/Leibniz_formula_for_%CF%80

Solution

Maybe these are not most critical issues with the code but they can be the source of problem:

You definetly should use barrier(CLK_LOCAL_MEM_FENCE); before local reduction. This can be avoided if only you know that work group size is equal or smaller than number of threads in wavefront running same instruction in parallel - 64 for AMD GPUs, 32 for NVidia GPUs.
Global reduction must be done in multiple launches of kernel because barrier() works for work items of same work group only. Clear and 100% working way to insert a barrier into kernel is splittion it in two in the place where global barier is needed.