This question is a follow-up/clarification to this:
Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?
This states the MOV
assembly instruction is sufficient to perform acquire-release semantics on x86. We do not need LOCK
, fences or xchg
etc. However, I am struggling to understand how this works.
Intel doc Vol 3A Chapter 8 states:
https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf
In a single-processor (core) system....
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes, with the following exceptions:
but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
In a multiple-processor system, the following ordering principles apply:
- Individual processors use the same ordering principles as in a single-processor system.
- Writes by a single processor are observed in the same order by all processors.
- Writes from an individual processor are NOT ordered with respect to the writes from other processors.
- Memory ordering obeys causality (memory ordering respects transitive visibility).
- Any two stores are seen in a consistent order by processors other than those performing the stores
- Locked instructions have a total order.
So how can MOV
alone can facilitate acquire-release?
but this is for a single core. The multi-core section does not seem to mention how loads are enforced:
The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.
The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.
The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst
just needs mfence
after SC stores, to drain the store buffer before any further loads can execute.
mfence
and lock
ed instructions (which are also full barriers) don't have to bother other cores, just make this one wait).
One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:
These multiprocessing mechanisms have the following characteristics:
- To maintain system memory coherency — When two or more processors are attempting simultaneously to access the same address in system memory, some communication mechanism or memory access protocol must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock a memory location.
- To maintain cache consistency — When one processor accesses data cached on another processor, it must not receive incorrect data. If it modifies data, all other processors that access that data must receive the modified data.
- To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed.
- [...]
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.
(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)
The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)
Related:
In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)
A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel
in C++ allows IRIW reordering; only seq_cst
forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.