c++multithreading c++20 memory-barriers stdatomic

Possible orderings with memory_order_seq_cst and memory_order_release

With reference to the following code

auto x = std::atomic<std::uint64_t>{0};
auto y = std::atomic<std::uint64_t>{0};

// thread 1
x.store(1, std::memory_order_release);
auto one = y.load(std::memory_order_seq_cst);

// thread 2
y.fetch_add(1, std::memory_order_seq_cst);
auto two = x.load(std::memory_order_seq_cst);

Is it possible here, that one and two are both 0?

(I seem to be hitting a bug that could be explained if one and two could both hold the value of 0 after the code above runs. And the rules for ordering are too complicated for me to figure out what orderings are possible above.)

Solution

Yes, it's possible for both loads to get 0.

Within thread 1, y.load can "pass" x.store(mo_release) because they're not both seq_cst. The global total order of seq_cst operations that ISO C++ guarantees must exist only includes seq_cst operations.

(In terms of hardware / cpu-architecture for a normal CPU, the load can take a value from coherent cache before the release-store leaves the store buffer. In this case, I found it much easier to reason in terms of how I know it compiles for x86 (or to generic release and acquire operations), then apply asm memory-ordering rules. Applying this reasoning assumes that the normal C++->asm mappings are safe, and are always at least as strong as the C++ memory model. If you can find a legal reordering this way, you don't need to wade through the C++ formalism. But if you don't, that of course doesn't prove it's safe in the C++ abstract machine.)

Anyway, the key point to realize is that a seq_cst operation isn't like atomic_thread_fence(mo_seq_cst) - Individual seq_cst operations only have to recover/maintain sequential consistency in the way they interact with other seq_cst operations, not with plain acquire/release/acq_rel. (Similarly, acquire and release fences are stronger 2-way barriers, unlike acquire and release operations as Jeff Preshing explains.)

The reordering that makes this happen

That's the only reordering possible; the other possibilities are just interleavings of the program-order of the two threads. Having the store "happen" (become visible) last leads to the 0, 0 result.

I renamed one and two to r1 and r2 (local "registers" within each thread), to avoid writing things like one == 0.

// x=0 nominally executes in T1, but doesn't have to drain the store buffer before later loads
auto r1 = y.load(std::memory_order_seq_cst);   // T1b             r1 = 0 (y)
         y.fetch_add(1, std::memory_order_seq_cst);      // T2a   y = 1 becomes globally visible
         auto r2 = x.load(std::memory_order_seq_cst);    // T2b   r2 = 0 (x)
x.store(1, std::memory_order_release);         // T1a             x = 0 eventually becomes globally visible

This can happen in practice on x86, but interestingly not AArch64. x86 can do release-store without additional barriers (just a normal store), and seq_cst loads are compiled the same as plain acquire, just a normal load.

On AArch64, release and seq_cst stores use STLR. seq_cst loads use LDAR, which has a special interaction with STLR, not being allowed to read cache until the last STLR drains from the store buffer. So release-store / seq_cst load on ARMv8 is the same as seq_cst store / seq_cst load. (ARMv8.3 added LDAPR, allowing true acquire / release by letting acquire loads compile differently; see this Q&A.)

However, it can also happen on many ISAs that use separate barrier instructions, like ARM32: a release store will typically be done with a barrier and then a plain store, preventing reordering with earlier loads / stores, but not stopping reordering with later. If the seq_cst load avoids needing a full barrier before itself (which is the normal case), then the store can reorder after the load.

For example, a release store on ARMv7 is dmb ish; str, and a seq_cst load is ldr; dmb ish, so you have str / ldr with no barrier between them.

On PowerPC, as seq_cst load is hwsync; ld; cmp; bc; isync, so there's a full barrier before the load. (The HeavyWeight Sync is I think part of preventing IRIW reordering, to block store-forwarding between SMT threads on the same physical core, only seeing stores from other cores when they actually become globally visible.)