Strange behavior of OpenCL atomic add operation

For a project, I had to dive into OpenCL: things are going fairly well except now that I need atomic operations. I'm executing the OpenCL code on top of an Nvidia GPU, with the last drivers. clGetDeviceInfo() querying CL_DEVICE_VERSION returns me: OpenCL 1.0 CUDA, hence I guess I have to refer to the OpenCL 1.0 specs.

I started using an atom_add operation in my kernel on a __global int* vnumber buffer: atom_add(&vnumber[0], 1);. This gave me clearly wrong results. Thus, as an additional check, I have moved the add instruction at the beginning of the kernel, so that it is executed for each thread. When the kernel is launched with 512 x 512 threads, the content of vnumber[0] is: 524288, which is exactly 2 x 512 x 512, two times the value that I should get. The funny thing is that by changing the add operation to atom_add(&vnumber[0], 2);, the returned value is 65536, again two times what I should get.

Did someone already experienced something similar? Am I missing something very basic? I have checked the correctness of data types but it seems ok (I'm using *int buffer, and allocating it with sizeof(cl_int)).

Solution

You are using atom_add, which is an OpenCL 1.0 extension for local memory. Yet you are passing it global memory. Instead, try OpenCL 1.1's atomic_add, which works with global memory.