In JCIP Section 15.2 B. Goetz mentioned that in modern processors there was a set of instructions like compare-and-swap (CAS) which allowed us to perform non-blocking update operations.
In particular he said
CAS has three operands - a memory location V on which to operate, the expected old value A, and the new value B. CAS atomically updates V to the new value B, but only if the value in V matches the expected old value A.
Now consider a multiprocessor system. If two different processors try to compare-and-swap at the exactly same time on the same memory location, what will happen than?
The processors do not know about each other. How does such a situation manage?
CAS operations in the end rely on specific hardware instructions, so the actual implementation is hardware-specific. If you want to learn about the details, you need to check the specific architecture on which you're running.
On the most popular x86 architecture, you can think about CAS operations as using a lock but on a much more granular level in the hardware.
E.g. one way to achieve a CAS operation is with LOCK CMPXCHG
. LOCK
is an instruction prefix that ensures exclusive access to the corresponding memory location for the duration of the accompanying instruction. The accompanying instruction in this case is CMPXCHG
, which can literally compare and swap the value at a given memory location as a single instruction.
When you execute myAtomicInt.compareAndSet(0, 1)
from a couple of different threads and that results in a LOCK CMPXCHG
from a couple of different processors, one lucky processor will gain exclusivity first and perform its operation successfully. The locked access to this memory location is linearizable, so the remaining processors that will gain exclusivity later will have their CASes fail (unless an interleaving change introduces an ABA problem).
The memory bus locking can still be elided (see 8.1.4 Effects of a LOCK Operation on Internal Processor Caches in Intel® 64 and IA-32 Architectures Software Developer’s Manual
Another hardware approach for executing CAS operations is LL/SC (Load Linked/Store Conditional) which uses somewhat different mechanism with somewhat different characteristics.
Roughly, it splits the CAS in two parts - the initial load and the conditional store, and it discards the store if there was a change on the related memory between the initial load and the store. This can allow you to avoid the ABA problem, but it has different scalability characteristics.
Here's an example usage of LL/SC in ARM using its LDREX
and STREX
instructions.
Some architectures like SPARC don't have dedicated CAS instructions. On them CAS needs to be implemented as a combination of instructions and eventually memory barriers.