Search code examples
c++multithreadingwinapiatomiclock-free

Is a memory barrier required to read a value that is atomically modified?


Given the following:

class Foo
{
public:
    void Increment()
    {
        _InterlockedIncrement(&m_value); // OSIncrementAtomic
    }

    long GetValue()
    {
        return m_value;
    }

private:
    long m_value;
};

Is a memory barrier required for the read of m_value? My understanding is that _InterlockedIncrement will generate a full memory barrier, and that ensures the value is stored before any subsequent loads occur. So that sounds safe from that aspect, but, can m_value be cached at all, i.e. can GetValue() return a stale value, even when incrementing atomically?

Jeff Preshing's excellent article(s) for reference: https://preshing.com/20120515/memory-reordering-caught-in-the-act/

Additional context: I'm following a series of articles on lock-free programming, and in particular, looking at the usage of the unfinishedJobs variable and the potential implementation of HasJobCompleted here: https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/

void Wait(const Job* job)
{
  // wait until the job has completed. in the meantime, work on any other job.
  while (!HasJobCompleted(job))
  {
    Job* nextJob = GetJob();
    if (nextJob)
    {
      Execute(nextJob);
    }
  }
}

Determining whether a job has completed can be done by comparing unfinishedJobs with 0.

So, given that context, would a possible implementation of HasJobCompleted require a memory barrier?


Solution

  • No, you don't need barriers, but your code is broken anyway if readers and writers call these functions in different threads. Especially if a reader calls the read function in a loop.

    TL:DR: use C++11 std::atomic<long> m_value with return m_value++ in the increment and return m_value in the reader. That will give you sequential consistency in a data-race-free program: execution will be work as if threads ran with some interleaving of source order. (Unless you violate the rules and have other non-atomic shared data.) You definitely want to return a value from Increment, if you want threads doing increments to ever know what value they produced. Doing a separate load after is totally broken for use cases like int sequence_num = shared_counter++; where another thread's increment could be visible between count++; tmp = count;.

    If you don't need such strong ordering with respect to operations on other objects in the same thread as the reader/writer, return m_value.load(std::memory_order_acquire) is enough for most uses, and m_value.fetch_add(1, std::memory_order_acq_rel). Very few programs actually need StoreLoad barriers anywhere; atomic RMWs can't actually reorder very much even with acq_rel. (On x86, those will both compile the same as if you'd used seq_cst.)

    You can't force ordering between threads; the load either sees the value or it doesn't, depending on whether the reading thread saw the invalidate from the writer before or after it took / tried to take a value for the load. The whole point of threads is that they don't run in lock-step with each other.


    Data-race UB:

    A loop reading m_value can hoist the load out of the loop since it's not atomic (or even volatile as a hack). This is data-race UB, compilers will break your reader. See this and Multithreading program stuck in optimized mode but runs normally in -O0

    Barriers aren't the problem/solution here, just forcing re-checking of memory (or the cache-coherent view of memory that the current CPU sees; actual CPU caches like L1d and L2 are not a problem for this). That's not what barriers really do; they order this thread's access to coherent cache. C++ threads only run across cores with coherent caches.

    But seriously don't roll your own atomics without a very compelling reason. When to use volatile with multi threading? pretty much never. That answer explains cache coherency and that you don't need barriers to avoid seeing stale values.

    In many real-world C++ implementations, something like std::atomic_thread_fence() will also be a "compiler barrier" that forces the compiler to reload even non-atomic vars from memory, even without volatile, but that's an implementation detail. So it may happen to work well enough, on some compilers for some ISAs. And still isn't fully safe against the compiler inventing multiple loads; See the LWN article Who's afraid of a big bad optimizing compiler? for examples with details; primarily aimed at how the Linux kernel rolls its own atomics with volatile, which is de-facto supported by GCC/clang.


    "latest value"

    update: See also Does atomic read guarantee reading of the latest value? for my more recent thoughts on why "latest value" is the wrong way to think about things for either correctness or performance (inter-thread latency), and what you can expect from real compilers on real CPUs.)

    Note that [atomics.order] p10 is often mis-quoted as RMWs reading the "latest value" (implying some kind of absolute sense or freshness guarantee), but it's actually the latest value in the mod order before the value they store. This just guarantees they're atomic, nothing more nothing less. See Is the order of a side effect in the modification order determined by when the side effect is produced? - later stores to the object can already be visible to loads in other threads but not the one doing the RMW. (Plus, on real CPUs the modification order of a variable isn't established for some time after stores are visible in their own thread, and potentially to some others; in the abstract machine it makes the most sense to look at the mod order as including all past and future stores for the lifetime of the object.)


    Beginners often panic over "latest value" concerns, and think that RMW operations are somehow better because of the way they're specified. Since they're a read + write tied together, and there is a modification order for every memory location separately, RMW operations necessarily have to wait for write access to a cache line, and that means serializing all writes and RMWs on a single location.

    Plain loads of atomic variables are still guaranteed (by real implementations) to see values promptly. (ISO C++ only suggests that values should be seen in finite time, and promptly, but of course real implementations can do much better because they run on cache-coherent CPU hardware.)

    There's no such thing as "immediate" between two threads; either a load in another thread sees a value stored, or it ran before the store became visible to other threads and didn't. With thread scheduling and so on, it's always possible that a thread will load a value but then not use it for a long time; it was fresh when it was loaded.

    So this is pretty much irrelevant for correctness, and all that's left is worrying about inter-thread latency. That could in some cases be helped by barriers (to reduce contention from later memory operations, not from actively flushing your stores faster, barriers just wait for that to happen the normal way). So that's usually a very minor effect, and not a reason to use extra barriers.

    See MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?. And see my comments on https://github.com/dotnet/runtime/issues/67330#issuecomment-1083539281 and Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? Often no, and if it does, not by much.

    Certainly not enough to be worth slowing down the reader with lots of extra barriers just to make it look at this atomic variable later than other atomic variables, if you didn't need that ordering for correctness. Or slowing down the writer to make it sit there doing nothing to maybe let it complete an RFO a few cycles sooner instead of getting other useful work done.

    If your use of threading is so bottlenecked on inter-core latency that it was worth it, that's probably a sign you need to rethink your design.

    Without barriers or ordering, just std::atomic with memory_order_relaxed, you'll still normally see data on other cores within maybe 40 nanoseconds (on a modern x86 desktop/laptop), about the same as if both threads were using atomic RMWs. And it's not possible for it to be delayed for any significant amount of time, like a microsecond maybe if you create lots of contention for lots of earlier stores so they all take a long time to commit before this one can. You definitely don't have to worry about going a long time with stores not being visible. (This of course only applies to atomic or hand-rolled atomics with volatile. Plain non-volatile loads may only check once at the start of a loop, and then never again. That's why they're unusable to multithreading.)