Search code examples
c++multithreadingstdatomicpropagationmemory-visibility

Atomic operation propagation/visibility (atomic load vs atomic RMW load)


Context

I am writing a thread-safe protothread/coroutine library in C++, and I am using atomics to make task switching lock-free. I want it to be as performant as possible. I have a general understanding of atomics and lock-free programming, but I do not have enough expertise to optimise my code. I did a lot of researching, but it was hard to find answers to my specific problem: What is the propagation delay/visiblity for different atomic operations under different memory orders?

Current assumptions

I read that changes to memory are propagated from other threads, in such a way that they might become visible:

  1. in different orders to different observers,
  2. with some delay.

I am unsure as to whether this delayed visibility and inconsistent propagation applies only to non-atomic reads, or to atomic reads as well, potentially depending on what memory order is used. As I am developing on an x86 machine, I have no way of testing the behaviour on weakly ordered systems.

Do all atomic reads always read the latest values, regardless of the type of operation and the memory order used?

I am pretty sure that all read-modify-write (RMW) operations always read the most recent value written by any thread, regardless of the memory order used. The same seems to be true for sequentially consistent operations, but only if all other modifications to a variable are also sequentially consistent. Both are said to be slow, which is not good for my task. If not all atomic reads get the most recent value, then I will have to use RMW operations just for reading an atomic variable's latest value, or use atomic reads in a while loop, to my current understanding.

Does the propagation of writes (ignoring side effects) depend on the memory order and the atomic operation used?

(This question only matters if the answer to the previous question is that not all atomic reads always read the most recent value. Please read carefully, I do not ask about the visibility and propagation of side-effects here. I am merely concerned with the value of the atomic variable itself.) This would imply that depending on what operation is used to modify an atomic variable, it would be guaranteed that any following atomic read receives the most recent value of the variable. So I would have to choose between an operation guaranteed to always read the latest value, or use relaxed atomic reads, in tandem with this special write operation that guarantees instant visibility of the modification to other atomic operations.


Solution

  • After some discussion, here are my findings: First, let's define what an atomic variable's latest value means: In wall-clock time, the very latest write to an atomic variable, so, from an external observer's point of view. If there are multiple simultaneous last writes (i.e., on multiple cores during the same cycle), then it doesn't really matter which one of them is chosen.

    1. Atomic loads of any memory order have no guarantee that the latest value is read. This means that writes have to propagate before you can access them. This propagation may be out of order with respect to the order in which they were executed, as well as differing in order with respect to different observers.

      std::atomic_int counter = 0;
      
      void thread()
      {
          // Imagine no race between read and write.
          int value = counter.load(std::memory_order_relaxed);
          counter.store(value+1, std::memory_order_relaxed);
      }
      
      for(int i = 0; i < 1000; i++)
          std::async(thread);
      

      In this example, according to my understanding of the specs, even if no read-write executions were to interfere, there could still be multiple executions of thread that read the same values, so that in the end, counter would not be 1000. This is because when using normal reads, although threads are guaranteed to read modifications to the same variable in the correct order (they will not read a new value and on the next value read an older value), they are not guaranteed to read the globally latest written value to a variable.

      This creates the relativity effect (as in Einstein's physics) that every thread has its own "truth", and this is exactly why we need to use sequential consistency (or acquire/release) to restore causality: If we simply use relaxed loads, then we can even have broken causality and apparent time loops, which can happen because of instruction reordering in combination with out-of-order propagation. Memory ordering will ensure that those separate realities perceived by separate threads are at least causally consistent.

    2. Atomic read-modify-write (RMW) operations (such as exchange, compare_exchange, fetch_add,…) are guaranteed to operate on the latest value as defined above. This means that propagation of writes is forced, and results in one universal view on the memory (if all reads you make are from atomic variables using RMW operations), independent of threads. So, if you use atomic.compare_exchange_strong(value,value, std::memory_order_relaxed) or atomic.fetch_or(0, std::memory_order_relaxed), then you are guaranteed to perceive one global order of modification that encompasses all atomic variables. Note that this does not guarantee you any ordering or causality of non-RMW reads.

      std::atomic_int counter = 0;
      
      void thread()
      {
          // Imagine no race between read and write.
          int value = counter.fetch_or(0, std::memory_order_relaxed);
          counter.store(value+1, std::memory_order_relaxed);
      }
      
      for(int i = 0; i < 1000; i++)
          std::async(thread);
      

      In this example (again, under the assumption that none of the thread() executions interfere with each other), it seems to me that the spec forbids value to contain anything but the globally latest written value. So, counter would always be 1000 in the end.

    Now, when to use which kind of read?

    If you only need causality within each thread (there might still be different views on what happened in which order, but at least every single reader has a causally consistent view on the world), then atomic loads and acquire/release or sequential consistency suffice.

    But if you also need fresh reads (so that you must never read values other than the globally (across all threads) latest value), then you should use RMW operations for reading. Those alone do not create causality for non-atomic and non-RMW reads, but all RMW reads across all threads share the exact same view on the world, which is always up to date.

    So, to conclude: Use atomic loads if different world views are allowed, but if you need an objective reality, use RMWs to load.