c++x86 memory-barriers memory-model stdatomic

C++ How is release-and-acquire achieved on x86 only using MOV?

This question is a follow-up/clarification to this:

Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?

This states the MOV assembly instruction is sufficient to perform acquire-release semantics on x86. We do not need LOCK, fences or xchg etc. However, I am struggling to understand how this works.

Intel doc Vol 3A Chapter 8 states:

https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf

In a single-processor (core) system....

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

In a multiple-processor system, the following ordering principles apply:

Individual processors use the same ordering principles as in a single-processor system.

Writes by a single processor are observed in the same order by all processors.

Writes from an individual processor are NOT ordered with respect to the writes from other processors.

Memory ordering obeys causality (memory ordering respects transitive visibility).

Any two stores are seen in a consistent order by processors other than those performing the stores

Locked instructions have a total order.

So how can MOV alone can facilitate acquire-release?

Solution

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.

The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.

The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst just needs mfence after SC stores, to drain the store buffer before any further loads can execute. mfence and locked instructions (which are also full barriers) don't have to bother other cores, just make this one wait).

One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:

These multiprocessing mechanisms have the following characteristics:

To maintain system memory coherency — When two or more processors are attempting simultaneously to access the same address in system memory, some communication mechanism or memory access protocol must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock a memory location.

To maintain cache consistency — When one processor accesses data cached on another processor, it must not receive incorrect data. If it modifies data, all other processors that access that data must receive the modified data.

To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed.

[...]

The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.

(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)

The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)

how are barriers/fences and acquire, release semantics implemented microarchitecturally?
x86 mfence and C++ memory barrier - asking why no barriers are needed for acq_rel, but coming at it from a different angle (wondering about how data ever becomes visible to other cores).
How do memory_order_seq_cst and memory_order_acq_rel differ? (seq_cst requires flushing the store buffer).
C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Globally Invisible load instructions program-order + store buffer isn't exactly the same as acq_rel, especially once you consider a load that only partially overlaps a recent store.
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - a formal memory model for x86.

Other ISAs

In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)

A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel in C++ allows IRIW reordering; only seq_cst forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.