I have been using atomicCAS
in a do-while
loop to perform various arithmetic operations when needed in my first parallel programs. I see that there are other operations like atomicInc
which would be the same thing as incrementing using atomicCAS
in a do-while
, correct? Would this be more efficient (in terms of clock cycles), or is there no point in transitioning away from my overuse of atomicCAS
?
The only sensible answer to that question is "every scenario where there is a purpose built atomic primitive for performing the same operation".
On nVIDIA GPUs, using atomicCAS for a faux mutex around arithmetic operations only makes sense when you have no other alternative. Even if there is no tangible performance difference today, by using an atomic primitive which translates to a PTX instruction, your are offering your code the possibility of performance gains on future hardware and future toolchains as NVIDIA improve their implementations.