I was trying to made a reduce sum, with PyOpenCL, similar to the example: https://dournac.org/info/gpu_sum_reduction . I'm trying to sum a vector with all values 1. The result should be 16384 in the first element. However it seems that just some points are being gathered. Is it necessary a local index? Is there any race condition (when I run it twice the result is not the same)? Whats wrong with the following code?
import numpy as np
import pyopencl as cl
def readKernel(kernelFile):
with open(kernelFile, 'r') as f:
data=f.read()
return data
a_np = np.random.rand(128*128).astype(np.float32)
a_np=a_np.reshape((128,128))
print(a_np.shape)
device = cl.get_platforms()[0].get_devices(cl.device_type.GPU)[0]
print(device)
ctx=cl.Context(devices=[device])
#ctx = cl.create_some_context() #ask which context to use
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
prg = cl.Program(ctx,readKernel("kernel2.cl")).build()
prg.test(queue, a_np.shape, None, a_g)
cl.enqueue_copy(queue, a_np, a_g).wait()
np.savetxt("teste2.txt",a_np,fmt="%i")
The kernel is:
__kernel void test(__global float *count){
int id = get_global_id(0)+get_global_id(1)*get_global_size(0);
int nelements = get_global_size(0)*get_global_size(1);
count[id] = 1;
barrier(CLK_GLOBAL_MEM_FENCE);
for (int stride = nelements/2; stride>0; stride = stride/2){
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
if (id < stride){
int s1 = count[id];
int s2 = count[id+stride];
count[id] = s1+s2;
}
}
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
}
The problem is that your kernel is implemented to do reduction within one workgroup and there is implicitely schedulled many workgroups.
Depending on the GPU there is different number of maximum work items per workgroup. For Nvidia that is 1024, AMD and Intel 256 (Intel in older GPUs had 512).
Lets assume for this example that maximum work items per workgroup on your GPU is 256. In this case the maximum 2d worgroup size can be 16x16, so if you use that size of your matrix your kernel will return correct result. Using the original size 128x128 and not specifying local size when scheduling the kernel the implementation calculates that for you and you are getting global size 128x128 and local size (very likely) 16x16 which means 8 worgroups are being scheduled.
In the current kernel each workgroup is starting calculation from different id
but the indices are reduced until 0 so you have race condition hence different results each run.
You have 2 options to fix this:
For 128x128 the first option will be preferred as it should perform faster and should be more straightforward to implement.