c++multithreading atomic micro-optimization memory-barriers

Does it make sense to use a relaxed load followed by a conditional fence, if I don't always need acquire semantics?

Consider following toy example, especially the result function:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

class Worker
{
    std::thread th;
    std::atomic_bool done = false;

    int value = 0;

  public:
    Worker()
        : th([&]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));
        value = 42;
        done.store(true, std::memory_order_release);
    }) {}

    int result() const
    {
        return done.load(std::memory_order_acquire) ? value : -1;
    }

    Worker(const Worker &) = delete;
    Worker &operator=(const Worker &) = delete;

    ~Worker()
    {
        th.join();
    }
};

int main()
{
    Worker w;
    while (true)
    {
        int r = w.result();
        if (r != -1)
        {
            std::cout << r << '\n';
            break;
        }
    }
}

I reckon that I need acquire sematics only if done.load() returns true, so I could rewrite it like this:

int result() const
{
    if (done.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        return value;
    }
    else
    {
        return -1;
    }
}

It seems to be a legal thing to do, but I lack experience to tell if this change makes sense or not (whether it's more optimized or not).

Which of the two forms should I prefer?

Solution

If most checks of done find it not-done, and happens in a throughput-sensitive part of your program, yes this could make sense, even on ISAs where a separate barrier costs more. Perhaps a use-case like an exit-now flag that also signals some data or a pointer a thread will want. You check often, but the great majority of the time you don't exit and don't need later operations to wait for this load to complete.

This is a win on some ISAs (where a load(acquire) is already a load+barrier), but on others it's usually worse, especially if the case we care about most (the "fast path") is the one that loads value. (On ISAs where a fence(acquire) is more expensive than a load(acquire), especially 32-bit ARM with ARMv8 new instructions: lda is just an acquire load, but a fence is still a dmb ish full barrier.)

If the !done case is common and there's other work to do, then it's maybe worth considering the tradeoff, since std::memory_order_consume is not currently usable for its intended purpose. (See below re: memory dependency ordering solving this specific case without any barrier.)

For other common ISAs, no, it wouldn't make sense because it would make the "success" case slower, maybe much slower if it ended up with a full barrier. If that's the normal fast-path through the function, that would obviously be terrible.

On x86 there's no difference: fence(acquire) is a no-op, and load(acquire) uses the same asm as load(relaxed). That's why we say x86's hardware memory model is "strongly ordered". Most other mainstream ISAs aren't like this.

For some ISAs this is pure win in this case. For ISAs that implement done.load(acquire) with a plain load and then the same barrier instruction fence(acquire) would use (like RISC-V, or 32-bit ARM without ARMv8 instructions). They have to branch anyway, so it's just about where we place the barrier relative to the branch. (Unless they choose to unconditionally load value and branchlessly select, like MIPS movn, which is allowed because they already load another member of that class Worker object so it's known to be a valid pointer to a full object.)

AArch64 can do acquire loads quite cheaply, but an acquire barrier would be more expensive. (And would happen on what would normally be the fast path; speeding up the "failure" path is normally not important.).

Instead of a barrier, a 2nd load, this time with acquire, could possibly be better. If the flag can only change from 0 to 1, you don't even need to re-check its value; accesses to the same atomic object are ordered within the same thread.

(I had a Godbolt link with some examples for many ISAs, but a browser restart ate it.)

Memory dependency order could solve this problem with no barriers

Unfortunately std::memory_order_consume is temporarily deprecated, otherwise you could have the best of both worlds for this case, by creating an &value pointer with a data-dependency on done.load(consume). So the load of value (if done at all) would be dependency-ordered after the load from done, but other independent later loads wouldn't have to wait.

e.g. if ( (tmp = done.load(consume)) ) and return (&value)[tmp-1]. This is easy in asm, but without fully working consume support, compilers would optimize out the use of tmp in the side of the branch that can only be reached with tmp = true.

So the only ISA that actually needs to make this barrier tradeoff in asm is Alpha, but due to C++ limitations we can't easily take advantage of the hardware support that other ISAs offer.

If you're willing to use something that will work in practice despite not having guarantees, use std::atomic<int *> done = nullptr; and do a release-store of &value instead of =true. Then in the reader, do a relaxed load, and if(tmp) { return *tmp; } else { return -1; }. If the compiler can't prove that the only non-null pointer value is &value, it will need to keep the data dependency on the pointer load. (To stop it from proving that, perhaps include a set member function that stores an arbitrary pointer in done, which you never call.)

See C++11: the difference between memory_order_relaxed and memory_order_consume for details, and a link to Paul E. McKenney's CppCon 2016 talk where he explains what consume was supposed to be for, and how Linux RCU does use the kind of thing I suggested, with effectively relaxed loads and depending on the compiler to make asm with data dependencies. (Which requires being careful not to write things where it can optimize away the data dependency.)