c++multithreading atomic memory-barriers

Force an atomic store to occur before other operations

If I have 2 threads that execute f() and f2(), like this:

std::atomic_flag flag; // a member variable in a class
std::mutex m;

flag.test_and_set(); // set to true before f() or f2()

// executed only once (a "stop" function)
void f() {
    flag.clear(std::memory_order_relaxed);

    {
        std::unique_lock<std::mutex> lock(m);
        // do something 
    }
    // join f2's thread
}

// long running function
void f2() {
    std::unique_lock<std::mutex> lock(m);

    while (flag.test(std::memory_order_relaxed)) {
        // do stuff
    }
}

f2() is generally executed first, and f() is supposed to stop f2().

It's important that flag.clear executes before the mutex is acquired, or else f() wouldn't return (in my actual case it's just that it would take a much longer time to finish).

So my question: Is relaxed memory ordering sufficient to guarantee that flag.clear() executes before locking the mutex?

In this question, a similar scenario is presented. From what I can understand, it's safe to use a relaxed memory order there because there are no dependencies on the flag other then joining the thread, and std::thread's join is a synchronizing operation. So therefore flag.clear() couldn't be moved after join.

However here, I use a std::mutex, but if I say switch to another implementation like this one: https://rigtorp.se/spinlock/, then would it still be safe to use a relaxed memory order? Because their lock only uses memory order acquire, isn't it technically possible for the locking of the mutex to be ordered entirely before flag.clear? As acquire only prevents operations after it from moving behind the acquire, but it doesn't say anything about stores moving after the acquire.

Would I need something like a StoreLoad barrier, like in here: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/? This question seems to suggest that I need a full memory barrier as well.

Or alternatively, if the flag.clear() store operation could somehow have an "acquire" memory ordering, then it would prevent operations after from being reordered before the store. But this comment suggests that you need sequentially consistent ordering instead:

If you really need to prevent something from being reordered before a store, AFAIK your only recourse is to use seq_cst on both the store and the other operation. Nothing weaker suffices in the C++ memory model.

Solution

This question doesn't actually have anything to do with memory barriers. It's really a question of forward progress that is addressed by [intro.progress p18]:

An implementation should(*) ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

So once f() has been called, and the flag.clear() line has been reached, then that value will be observed in finite time, come hell or high water. It doesn't matter what code comes next, be it lock() or recv() or wait_for_godot(). As such, the load in f2() will eventually return false. And thus, the compiler is forbidden for performing any transformation that could prevent that from taking place.

Note that this applies regardless of what std::memory_order is used, and doesn't need any fences or anything else.

One might think this implies that an atomic load or store can't be reordered with a mutex lock, nor with any other operation that can't be proved to always finish in finite time. But the situation is a little more subtle than that: read on.

On the other hand, it is still true that a mutex lock is an acquire barrier only. In general, it is allowed for operations sequenced before the lock to become visible after the lock is acquired, provided that the finite-time rule is still obeyed.

For instance, suppose the machine knows that the store to flag is going to take a long time (maybe it missed cache). Let's say 10us. It would then be allowed to "schedule" the store to execute when ready, and go on to start trying to lock the mutex. If it should happen that the mutex were successfully acquired in less than 10us, it would be fine for the machine to acquire it and start executing the // do something. In this case, the something could become visible before the store to flag. The key, however, is that if the mutex is not successfully acquired with 10us, then the machine promises to make the store visible anyway.

Now in your particular program, due to its logic, it is not actually possible for the mutex to be acquired before the store to flag is visible to f2. So the machine is free to start trying the mutex before 10us is up, and it can have the intention of immediately going on to // do something if successful, but it so happens that this will not come to pass.

So, we could say that a compiler is only allowed to unconditionally reorder an atomic load or store with another operation, if it can prove that the other operation must complete in finite time. But there can still be reordering if it's "conditional" as in the above example: "if the lock is acquired within 10us, then reorder it with the store, otherwise don't".

As a more concrete way to say it, the following is a legitimate way for the compiler to rewrite f():

if (m.try_lock()) {
    // do something
    flag.clear();
} else {
    flag.clear();
    m.lock();
    // do something
}
m.unlock();

(Assuming that do something itself always completes in finite time, and doesn't contain any release barriers.)

In this version, there is a scenario where the mutex has been taken, but the flag has not yet been cleared. (In your program, because of what the other thread is doing, the if branch is not actually reachable; but in some other context it could be.) That's the sense in which they can be reordered, while still satisfying the finite time rule.

But the following, which is what you are worried about, is forbidden:

m.lock();
flag.clear();
// ...

(*) Some people get worried because this says "should" instead of "shall" or "must". So technically, obeying [intro.progress p18] is a matter of quality-of-implementation rather than conformance. However, an implementation that doesn't do it will be so broken that, whether or not you consider it "conformant", it will be unusable.

There are a few instances of this kind of thing in the standard, another example being [atomics.order p8], the "no out-of-thin-air values" rule, which is also worded as "should". I think the issue is that they each specify a requirement in a somewhat informal way, "you obviously know what we mean here", rather than with mathematical formality. If they said "must", it could be a problem for some vendor trying to formally verify that their implementation conforms, because the requirement itself is not stated precisely enough to make that possible.