c++concurrency atomic memory-barriers memory-model

Can loads slip beneath an acquire operation / can stores float above a release in C++?

TL/DR: is it true that only 1 (and not 2) of 4 reorderings is allowed for acquire/release operations? If so, why?

For now from what I understood about acquire-release semantics is that (basically)

an acquire operation doesn't allow loads/stores beneath it to float above it
a release operation doesn't allow loads/stores above it to slip beneath it

But fewer is said about the reverse direction.

From some sources (Jeff Preshing's and others' blogs, and some architecture manuals seem to imply that) I read that acquire/release operations are equivalent to a (atomic op + memory barrier on a given memory location)/(memory barrier + atomic op) correspondingly.

They describe 4 kinds of memory barriers and say that e.g. an acquire operation uses a barrier which is like LoadLoad + LoadStore (and similar for release).

As I understand those barriers (LoadLoad + LoadStore and StoreStore + LoadStore correspondingly) only allow:

a store to slip beneath an acquire
a load to float above a release

And a load cannot slip beneath an acquire / a store cannot float above the release.

Is that generally correct? Is that correct for C++? Is it different for C++ vs the general meaning?

(Because e.g. this answer says that a load can slip beneath an acquire (as I understand it). I also had a couple of sources that said anything can slip beneath an acquire (and vice-versa).)

If that is correct, what's the rationale for that? I was trying to come up with something like (for release):

x.store(5, std::memory_order_release);
y.store(true, std::memory_order_relaxed);

And a different thread reading them in different order would be a bad thing, considering it's used in patterns like double-checked locking.

Is that close to a reason? If so could someone give solid examples for both acquire and release?

While for store slipping beneath an acquire / load floating above a release there are (probably) no such drawbacks...

Solution

Memory barriers can be used to implement load-acquire and store-release semantics, but they provide guarantees that are more strict than required, as stated in Jeff Preshing's article:

Please note that these barriers are technically more strict than what’s required for acquire and release semantics on a single memory operation, but they do achieve the desired effect.

If you place a LoadLoad + LoadStore barrier(s) between the load-acquire and subsequent memory operations, then all loads prior to the barrier in program order cannot be reordered after the barrier and all later memory accesses cannot be reordered before the barrier. This is more strict than necessary to implement acquire semantics for a specific load operation because the barrier orders all previous loads, not just the specific load that needs to have acquire semantics. So they are not exactly equivalent. The same thing goes for store-release semantics. Herb Stutter wrote a comment regarding that:

Yes, this is a bug in my presentation (the words more than the actual slide). The example is fine but I should fix the description of "if this was a release fence." In particular:

starting at 1:10:30, I was incorrect to say that a release fence has a correctness problem because it allows stores to float up (it does not, as noted the rule is in 29.8.2; thanks!) – what I should have said was that it’s a still a performance pessimization because the fence is not associated with THAT intended store, but since we don’t know which following store it has to pessimistically apply to ALL ensuing ordinary stores until the next applicable special memory operation synchronization point – it pushes them all down and often doesn’t need to

The reason that load-acquire and store-release semantics are implemented in terms of LoadLoad, LoadStore, and StoreStore barriers is because ISAs only provide such barriers. There are research proposals for more flexible or configurable barriers that may only apply to specific memory operations or a range or block of instructions, but they have not made their way yet to any of the ISAs.