Whats the difference between local and global memory fence in OpenCL?

I was trying to made a reduce sum, with PyOpenCL, similar to the example: https://dournac.org/info/gpu_sum_reduction . I'm trying to sum a vector with all values 1. The result should be 16384 in the first element. However it seems that just some points are being gathered. Is it necessary a local index? Is there any race condition (when I run it twice the result is not the same)? Whats wrong with the following code?

import numpy as np
import pyopencl as cl

def readKernel(kernelFile):
    with open(kernelFile, 'r') as f:
        data=f.read()
    return data

a_np = np.random.rand(128*128).astype(np.float32)
a_np=a_np.reshape((128,128))
print(a_np.shape)

device = cl.get_platforms()[0].get_devices(cl.device_type.GPU)[0]
print(device)
ctx=cl.Context(devices=[device])
#ctx = cl.create_some_context() #ask which context to use 
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags

a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)

prg = cl.Program(ctx,readKernel("kernel2.cl")).build()

prg.test(queue, a_np.shape, None, a_g)

cl.enqueue_copy(queue, a_np, a_g).wait()
np.savetxt("teste2.txt",a_np,fmt="%i")

The kernel is:

__kernel void test(__global float *count){
    int id = get_global_id(0)+get_global_id(1)*get_global_size(0);
    int nelements = get_global_size(0)*get_global_size(1);

    count[id] = 1;
    barrier(CLK_GLOBAL_MEM_FENCE); 

    for (int stride = nelements/2; stride>0; stride = stride/2){
        barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
        if (id < stride){
            int s1 = count[id];
            int s2 = count[id+stride];
            count[id] = s1+s2;
        }
    }
    barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
}

Solution

The problem is that your kernel is implemented to do reduction within one workgroup and there is implicitely schedulled many workgroups.

Depending on the GPU there is different number of maximum work items per workgroup. For Nvidia that is 1024, AMD and Intel 256 (Intel in older GPUs had 512).

Lets assume for this example that maximum work items per workgroup on your GPU is 256. In this case the maximum 2d worgroup size can be 16x16, so if you use that size of your matrix your kernel will return correct result. Using the original size 128x128 and not specifying local size when scheduling the kernel the implementation calculates that for you and you are getting global size 128x128 and local size (very likely) 16x16 which means 8 worgroups are being scheduled. In the current kernel each workgroup is starting calculation from different id but the indices are reduced until 0 so you have race condition hence different results each run.

You have 2 options to fix this:

Rewrite your kernel to calculate everything within one workgroup and schedule it with global,local size: (16x16),(16,16) or whatever your max work items per workgroup device has
Use global,local size: (128x128),(16x16) and each workgroup will calculate its result which then on the cpu side will have to be sum up for each workgroup to get the final result.

For 128x128 the first option will be preferred as it should perform faster and should be more straightforward to implement.