Why does atomic operation need exclusive cache access?

In my understanding atomic operation (c++ atomic for example) first locks the cache line and then performs atomic operation. I have two questions: 1. if let's say atomic compare and swap is atomic operation itself in hardware why we need to lock cache line and 2. when cache line is locked how another cpu is waiting for it? does it use spin-lock style waiting?

thanks

Solution

First of all: It depends!

1.) If a system locks a cache line has nothing to do with c++. It is a question how a cache is organized and especially how assembler instructions acts with cache. That is a question for cpu architecture!

2.) How a compiler performs a atomic operation is implementation depended. Which assembler instructions will be generated to perform a atomic operation can vary from compiler to compiler and even on different versions.

3.) As I know, a full lock of a cache line is only the fall back solution if no "more clever" notification/synchronization of other cores accessing the same cache lines can be performed. But there are not only a single cache involved typically. Think of multi level cache architecture. Some caches are only visible to a single core! So there is a need of performing also more memory system operations as locking a line. You also have to move data from different cache levels also if multiple cores are involved!

4.) From the c++ perspective, a atomic operation is not only a single operation. What really will happen depends on memory ordering options for the atomic operation. As atomic operations often used for inter thread synchronization, a lot more things must be done for a single atomic RMW operation! To get an idea what all has to be done you should give https://www.cplusplusconcurrencyinaction.com/ a chance. It goes into the details of memory barriers and memory ordering.

5.) Locking a cache line ( if this really happens ) should not result in spin locks or other things on other cores as the access for the cache line itself took only some clock cycles. Depending on the architecture it simply "holds" the other core for some cycles. It may happen that the "sleeping" core can do in parallel other things in a different pipe. But hey, that is very hardware specific.

As already given as a comment: Take a look on https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/, it gives some hints what can happen with cache coherency and locking.

There is much more than locking under the hood. I believe your question scratches only on the surface!

For practical usage: Don't think about! Compiler vendors and cpu architects have done a very good job. You as a programmer should measure your code performance. From my perspective: No need to think about of what happens if cache lines are locked. You have to write good algorithms and think about good memory organization of your program data and less interrelationships between threads.