c++memory-model stdatomic instruction-reordering

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.

Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?

Store release is quite understandable, have to wait for all load and store are completed before set flag to true.

About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:

int x = 0;
std::atomic<bool> ready_flag = false;

// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
    // (1)
    // load x here
}
// (2)
// load x here

// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);

EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.

Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?

Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?

Solution

Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.

As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:

Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.

To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.