c++concurrency language-lawyer atomic stdatomic

Are seq-cst fences exactly the same as acq-rel fences in absence of seq-cst loads?

I'm trying to understand the purpose of std::atomic_thread_fence(std::memory_order_seq_cst); fences, and how they're different from acq_rel fences.

So far my understanding is that the only difference is that seq-cst fences affect the global order of seq-cst operations ([atomics.order]/4). And said order can only be observed if you actually perform seq-cst loads.

So I'm thinking that if I have no seq-cst loads, then I can replace all my seq-cst fences with acq-rel fences without changing the behavior. Is that correct?

And if that's correct, why am I seeing code like this "implementation Dekker's algorithm with Fences", that uses seq-cst fences, while keeping all atomic reads/writes relaxed? Here's the code from that blog post:

std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);

void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}

void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}

Solution

As I understand it, they're not the same, and a counterexample is below. I believe the error in your logic is here:

And said order can only be observed if you actually perform seq-cst loads.

I don't think that's true. In atomics.order p4 which defines the axioms of the sequential consistency total order S, items 2-4 all may involve operations which are not seq_cst. You can observe the coherence ordering between such operations, and this can let you infer how the seq_cst operations have been ordered.

As an example, consider the following version of the StoreLoad litmus test, akin to Peterson's algorithm:

std::atomic<bool> a,b;  // initialized to false

void thr1() {
    a.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (b.load(std::memory_order_relaxed) == false)
        std::cout << "thr1 wins";
}

void thr2() {
    b.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (a.load(std::memory_order_relaxed) == false)
        std::cout << "thr2 wins";
}

Note all the loads are relaxed.

I claim that if thr1 prints "thr1 wins", then we deduce that a.store(true) preceded b.store(true) in the sequential consistency order S.

To see this, let A be b.load() and B be b.store(true). If b.load() == false then we have that A is coherence-ordered before B. (Apply atomics.order p3.3 with A,B as above, and X the initialization of b to false.)

Now let X be the fence in thr1. Then X happens before A by sequencing, so X precedes B in S by atomics.order p4.3. That is, the thr1 fence precedes b.store(true). And a.store(true), which is also seq_cst, precedes the thr1 fence, because the store strongly happens before the fence by sequencing. So by transitivity, a.store(true) precedes b.store(true), as claimed.

Similarly, if thr2 prints, then b.store(true) precedes a.store(true). They can't both precede each other, so we have therefore proved that the program cannot print both messages.

If you downgrade the fences to acq_rel, the proof breaks down. In that case, as far as I can see, nothing prevents the program from printing thr1 wins even if b.store(true) precedes a.store(true) in the order S. As such, with acq_rel fences, I believe it is allowed for both threads to print. Though I'm not sure whether there is any real implementation where it could actually happen.

We can get an even simpler example if we make all the loads and stores relaxed, so that the only seq_cst operations are the fences. Then we can use (4.4) instead to show that if b.load(relaxed) returns false, the thr1 fence precedes the thr2 fence, and vice versa if a.load() returns false. We therefore conclude, as before, that the program cannot print both messages.

However, if we keep the loads and stores relaxed and weaken the fences to acq_rel, it is then more clear that we have lost this guarantee. Indeed, with a little prodding, a similar example actually fails on x86, where the acq_rel fence is a no-op because the loads/stores are already acquire/release. So that's a clearer case where a seq_cst fence really achieves something that acq_rel does not.