For a project, I had to dive into OpenCL: things are going fairly well except now that I need atomic operations.
I'm executing the OpenCL code on top of an Nvidia GPU, with the last drivers. clGetDeviceInfo()
querying CL_DEVICE_VERSION
returns me:
OpenCL 1.0 CUDA
, hence I guess I have to refer to the OpenCL 1.0 specs.
I started using an atom_add
operation in my kernel on a __global int* vnumber
buffer:
atom_add(&vnumber[0], 1);
. This gave me clearly wrong results. Thus, as an additional check, I have moved the add instruction at the beginning of the kernel, so that it is executed for each thread. When the kernel is launched with 512 x 512 threads, the content of vnumber[0]
is: 524288
, which is exactly 2 x 512 x 512, two times the value that I should get. The funny thing is that by changing the add operation to atom_add(&vnumber[0], 2);
, the returned value is 65536
, again two times what I should get.
Did someone already experienced something similar? Am I missing something very basic? I have checked the correctness of data types but it seems ok (I'm using *int
buffer, and allocating it with sizeof(cl_int)
).
You are using atom_add, which is an OpenCL 1.0 extension for local memory. Yet you are passing it global memory. Instead, try OpenCL 1.1's atomic_add, which works with global memory.