When should I favor a more specific atomic operation over using atomicCAS?

I have been using atomicCAS in a do-while loop to perform various arithmetic operations when needed in my first parallel programs. I see that there are other operations like atomicInc which would be the same thing as incrementing using atomicCAS in a do-while, correct? Would this be more efficient (in terms of clock cycles), or is there no point in transitioning away from my overuse of atomicCAS?

Solution

The only sensible answer to that question is "every scenario where there is a purpose built atomic primitive for performing the same operation".

On nVIDIA GPUs, using atomicCAS for a faux mutex around arithmetic operations only makes sense when you have no other alternative. Even if there is no tangible performance difference today, by using an atomic primitive which translates to a PTX instruction, your are offering your code the possibility of performance gains on future hardware and future toolchains as NVIDIA improve their implementations.