cuda atomic multicore gpu-shared-memory ptx

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.

On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):

However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized

which hints to load/store atomicity "per-default" in shared memory. Thus, would the instructions ld.weak.shared.cta and ld.relaxed.shared.cta have the same effect? Or is it an information the compiler needs anyway to avoid optimizing away load and store?

More generally, supposing variables are properly aligned, would __shared__ int and __shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?

Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int and __device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?

Thanks for any insight.

Solution

Serialization does not imply atomicity: thread A writes the 2 first bytes of an integer, then thread B reads the variable a, and finally thread A writes the last 2 bytes. All of this happening in sequence (not in parallel), but still not being atomic.

Further, serialization is not guaranteed in all cases, see:

Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

Conclusion: use atomic.