Consider the code below:
std::atomic<int> a = 100;
---
CPU 0:
a.store(101, std::memory_order_relaxed);
---
CPU 1:
int tmp = a.load(std::memory_order_relaxed); // Assume `tmp` is 101.
Let's assume that CPU 0 happens to store to a
earlier in time before CPU 1 loads a
(whether the load is reordered or not). Thus, in this scenario, tmp
will be 101 instead of 100.
If the MOESI coherence protocol is used, then when CPU 0 stores to a
, CPU 0 acquires the cache line in modified (M) mode. The store goes to CPU 0's store buffer. If CPU 1 had the cache line in its own cache, then its copy of the cache line transitions to invalid (I) mode.
When CPU 1 loads a
, the cache line is transitioned to shared (S) mode (or maybe owned (O) mode).
Assume that a
is still in CPU 0's store buffer when CPU 1 loads a
. Given that CPU 1 cannot read CPU 0's store buffer, then when CPU 1 reads the cache line with a
, does this imply that CPU 0's store buffer is flushed (or at least, the cache line with a
is flushed from CPU 0's store buffer)?
If the flush did not happen, then this implies that both CPU 0 and CPU 1 both have the cache line in shared (S) mode, but CPU 0 sees a
with the value of 101 and CPU 1 sees a
with a value of 100.
Note: I am asking about MOESI while each microarchitecture implements its own coherence protocol. I would imagine that this concern is handled similarly in most microarchitectures though.
Store buffers aren't snooped by loads from other cores; they're private. Stores become globally visible when they commit from the store buffer to L1d cache. (The core has to get MESI exclusive ownership of the line before it can do that, E or M state.)
This has to wait until after the store instruction has graduated, aka retired from the ROB (ReOrder Buffer) so it's known to be non-speculative. A store buffer is necessary to allow speculative execution of stores, containing to this core the speculative state that might need to be rolled back if mis-speculation is detected (e.g. a branch mispredict or a fault in an earlier instruction).
A core can see its own stores (via store forwarding) before they become globally visible (to any other cores). This "reordering" is somewhat separate from the usual StoreLoad reordering introduced by a store buffer when later loads are to different addresses. See also Globally Invisible load instructions for some discussion of it. (And fun corner cases like a load that partially overlaps with a store seeing a value that no other core could ever have seen.)
x86's TSO memory model is program order with a store buffer + store forwarding1 for each core's accesses to coherent shared cache. (See Preshing's analogy, Memory Barriers Are Like Source Control Operations.) It's important to mention store-forwarding, because it can produce effects you wouldn't see if a load that "hit" an address already in the store buffer just stalled until the store buffer committed to cache.
A cache line has to be exclusively owned before the store can commit to L1d (and become globally visible), but store forwarding to this core's own loads can happen without that.
(On most architectures, commit to L1d and MESI coherency is the only way for a store to become visible outside the current core at all. But PowerPC allows forwarding "graduated" stores to the other logical SMT cores, making IRIW reordering possible.)
Footnote 1: This is what 486 or P5 Pentium "naturally" did, with in-order pipelines and a store buffer, before an x86 memory model was really documented. P6 took pains not to introduce any new memory-reordering to avoid breaking existing multi-threaded code. It speculatively loads early, but rolls back with a memory-order mis-speculation pipeline nuke if it detects that the cache line has been invalidated between when it actually loaded and when it's architecturally allowed to load.