c++multithreading atomic memory-barriers

Memory ordering with multiple releases and a single acquire

Consider the following code:

#include <atomic>
#include <thread>
#include <cassert>
#include <memory>

int i = 0;
std::atomic_int a{0};

int main()
{
    std::thread thr1{[]
    {
        i = 1; // A
        a.store(1, std::memory_order::release); // B
    }};
    std::thread thr2{[]
    {
        while (a.load(std::memory_order::relaxed) != 1); // C
        a.store(2, std::memory_order::release); // D
    }};
    std::thread thr3{[]
    {
        while (a.load(std::memory_order::acquire) != 2); // E
        assert(i == 1); // F
        
    }};
    thr1.join();
    thr2.join();
    thr3.join();
}

My assumption is that the assert may or may not fail and the behavior here is undefined.
Although we have the happens-before relationships such as A->B, C->D, D->E, E->F, we don't have such a relationship for B->C because of the relaxed load in C. On the other hand, https://en.cppreference.com/w/cpp/atomic/memory_order says that

All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.

But we can't say that B->D is a release sequence headed by B because there are no read-modify-write operations on a whatsoever, so this paragraph doesn't work here.

Am I right in my understanding?

Solution

You are correct. Either the assertion succeeds, or the behavior is undefined (though in practice, that means the assertion should fail).

There is an intuitive way to explain it and a formal way.

Intuition

In an acquire/release memory model, every thread views the memory changes done by other threads as a timeline, and observes some point on that timeline. Threads can disagree over the value that objects hold because they each see their own time point for other threads' modification timeline.

While thr3 can only reach assert(i == 1); once thr2 has completed all its work due to the busy-wait in thr3, at the point of the assertion, thr3 can still have an outdated view of thr1 (where i is set). This outdated view is older than the one thr2 has.

Note that thr1 has to complete all of its work before thr2 can do its work, and thr2 has to complete all of its work before thr3 can do its work. However, this restriction on the sequence of events doesn't restrict what memory operations become visible between threads. Even if thr2 did a.load(std::memory_order::acquire), this wouldn't impose that thr3 has to have a more recent view of thr1.

Formal explanation

i = 1; and assert(i == 1); are possibly a data race. Either the assertion succeeds, or the behavior is undefined.

The problem is that it is unspecified whether a.store(1, std::memory_order::release); synchronizes with a.load(std::memory_order::acquire). The condition for this is found in [atomics.order] p2

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

If the load takes its value from the store(1) at some point, then a happens before relationship is formed because i = 1 is sequenced before that store(1) and store(1) synchronizes with load() in thr3.

However, there is no guarantee that this happens ([intro.races] p14):

The value of an atomic object M, as determined by evaluation B, is the value stored by some unspecified side effect A that modifies M, where B does not happen before A.

A plausible sequence of events is that thr2 completes all of its work before thr3 has started. thr3 then loads its value for a from the side effect a.store(2, std::memory_order::release); // D, so it never takes its value from the store in thr1.

In that event, you have a data race (between i = 1 and i == 1), and the behavior is undefined.