Search code examples
c++c++11concurrencyatomicmemory-barriers

Memory order around atomic operations


I want to build a good mental image about std::memory_order_seq_cst, std::memory_order_relaxed, memory_order_acq_rel.

Imagine your code as a sequence of bricks. Each brick is an operation on some variable. There are only two atomic variables x and y in the code. Atomic operations on x are colored red. Atomic operations on variable y are colored blue. The rest operations are uncolored --- in gray.

Sequential consistency std::memory_order_seq_cst is defined by the following constraints: (i) any colored brick will always stay at its original "position", (ii) gray bricks between any two colored bricks might be reshuffled by the compiler.

The relaxed memory order std::memory_order_relaxed is almost the opposite of the sequential consistency: (i) the red (blue) bricks could be shifted in any manner as long as the order of the red (blue) sequence does not change.

Things get murkier to me about the acquire-release order std::memory_order_acq_rel. I tend to think they are both relaxations from sequential consistency: any colored brick can be shifted around, but it should never collide with another colored brick. In other words, if the code says atomic operation B happens between A and C, then the compiler should never shuffle the code in a way such that B would be executed before A or after C.

I made the following figures:

enter image description here

Figure 1 shows the order of operations in code. Figure 2 is one possibility under sequential consistency. Figure 3 is one possibility under relaxed memory order.

My questions are:

  1. Did I correctly understand sequential consistency and the relaxed memory order?

  2. Is my visualization accurate? Is there any better visualization?

  3. How to delineate acquire-release order with similar visualization?

I am currently not interested in std::memory_order_consume.

And yes, I have read https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync. But it's still hard for me to digest quickly. I am wondering if there is an even easier way to load this piece of knowledge in my brain after two years without touching the subject.

Thanks a lot!


Solution

  • Orders weaker than seq_cst can produce effects not explainable in terms of a total order of all load and store operations across all threads.


    One major example is store-forwarding, where a thread's loads see its own stores before they become globally visible (~all ISAs can do this.) Only seq_cst blocks StoreLoad reordering.

    See also Globally Invisible load instructions (and the difficulty of creating a global order of all ops that includes loads, not just stores.)

    x86's TSO (Total Store Order) memory model is like acq_rel but without IRIW reordering. Or to put it another way: program order plus a store buffer with store-forwarding. The store buffer alone introduces StoreLoad reordering, see Jeff Preshing's Memory Reordering Caught in the Act demo.

    x86 TSO does let you "linearize" the stores from all threads into a single stack of bricks like you're doing, but not loads. C++ orders weaker than seq_cst don't guarantee even that, although it is true on most ISAs.


    Another example is IRIW reordering, where different readers disagree about the order of two threads writing to different variables. (Only PowerPC and some GPUs do this in real life, making stores visible to SMT sibling cores before global visibility).

    ISO C++ allows this with any memory order weaker than seq_cst.


    https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and https://preshing.com/20120913/acquire-and-release-semantics/ are good building blocks for a mental model in terms of local reordering of operations that access a (cache-)coherent view of memory.

    The ISO C++ rules are written in terms of an acquire load seeing a value from a release store creating a happens-before relationship, guaranteeing visibility of earlier stuff. So in the formalism itself, a release store doesn't mean anything if there isn't an acquire load to see it, but in practice compilers don't try to globally prove stuff about synchronization; they handle atomic operations so that they'll work right for any hypothetical observer in a data-race-free program.