c++concurrency memory-barriers stdatomic

atomic exchange with memory_order_acquire and memory_order_release

I have a situation that I would like to prepare some data in one thread:

// My boolean flag
std::atomic<bool> is_data_ready = false;

Thread 1 (producer thread):
  PrepareData();
  if (!is_data_ready.exchange(true, std::memory_order_release)) {
    NotifyConsumerThread();
  }
  else {
    return;
  }

In consumer thread,

Thread 2:
  if (is_data_ready.exchange(false, std::memory_order_acquire)) {
    ProcessData();
  }

Does it make sense to use acquire/release order (instead of acq_rel order) for exchange? I am not sure if I understand it correctly: does std::memory_order_release in exchange mean the store is a release store? If so, what is the memory order for the load?

Solution

An atomic RMW has a load part and a store part. memory_order_release gives the store side release semantics, while leaving the load side relaxed. The reverse for exchange(val, acquire). With exchange(val, acq_rel) or seq_cst, the load would be an acquire load, the store would be a release store.

(compare_exchange_weak/_strong can have one memory order for the pure-load case where the compare failed, and a separate memory order for the RMW case where it succeeds. This distinction is meaningful on some ISAs, but not on ones like x86 where it's just a single instruction that effectively always stores, even in the false case.)

And of course atomicity of the exchange (or any other RMW) is guaranteed regardless of anything else; no stores or RMWs to this object by other cores can come between the load and store parts of the exchange. Notice that I didn't mention pure loads, or operations on other objects. See later in this answer and also For purposes of ordering, is atomic read-modify-write one operation or two?

Yes, this looks sensible, although simplistic and maybe racy in allowing more stuff to be published after the first batch is consumed (or started to consume)¹. But for the purposes of understanding how atomic RMWs work, and the ordering of its load and store sides, we can ignore that.

exchange(true, release) "publishes" some shared data stored by PrepareData(), and checks the old value to see if the worker thread needs to get notified.

And in the reader, is_data_ready.exchange(false, acquire) is a load that syncs with the release-store if there was one, creating a happens-before relationship that makes it safe to read that data without data-race UB. And tied to that (as part of the atomic RMW), lets other threads see that it has gone past the point of checking for new work, so it needs another notify if there is any.

Yes, exchange(value, release) means the store part of the RMW has release ordering wrt. other operations in the same thread. The load part is relaxed, but the load/store pair still form an atomic RMW. So the load can't take a value until this core has exclusive ownership of the cache line.

Or in C++ terms, it sees the "latest value" in the modification order of is_data_ready; if some other thread was also storing to is_data_ready, that store will happen either before the load (before the whole exchange), or after the store (after the whole exchange).

Note that a pure load in another core coming after the load part of this exchange is indistinguishable from coming before, so only operations that involve a store are part of the modification order of an object. (That modification order is guaranteed to exist such that all threads can agree on it, even when you're using relaxed loads/stores.)

But the load part of another atomic RMW will have to come before the load part of the exchange, otherwise that other RMW would have this exchange happening between its load and its store. That would violate the atomicity guarantee of the other RMW, so that can't happen. Atomic RMWs on the same object effectively serialize across threads. That's why a million fetch_add(1, mo_relaxed) operations on an atomic counter will increment it by 1 million, regardless of what order they end up running in. (See also C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads re: why atomic RMWs have to work this way.)

C++ is specified in terms of syncs-with and whether a happens-before guarantee exists that allows your other loads to see other stores by other threads. But humans often like to think in terms of local reordering (within execution of one thread) of operations that access shared memory (via coherent cache).

In terms of a memory-reordering model, the store part of an exchange(val, release) can reorder with later operations other than release or seq_cst. (Note that unlocking a mutex counts as a release operation). But not with any earlier operations. This is what acquire and release semantics are all about, as Jeff Preshing explains: https://preshing.com/20120913/acquire-and-release-semantics/.

Wherever the store ends up, the load is at some point before it. Right before it in the modification order of is_data_ready, but operations on other objects by this thread (especially in other cache lines) may be able to happen in between the load and store parts of an atomic exchange.

In practice, some CPU architectures don't make that possible. Notably x86 atomic RMW operations are always full barriers, which waits for all earlier loads and stores to complete before the exchange, and doesn't start any later loads and stores until after. So not even StoreLoad reordering of the store part of an exchange with later loads is possible on x86.

But on AArch64 you can observe StoreLoad reordering of the store part of a seq_cst exchange with a later relaxed load. But only the store part, not the load part; being seq_cst means the load part of the exchange has acquire semantics and thus happens before any later loads. See For purposes of ordering, is atomic read-modify-write one operation or two?

Footnote 1: is this a usable producer/consumer sync algorithm?

With a single boolean flag (not a queue with a read-index / write-index), IDK how a producer would know when it can overwrite the shared variables that the consumer will look at. If it (or another producer thread) did that right away after seeing is_data_ready == false, you'd race with the reader that's just started reading.

If you can solve that problem, this does appear to avoid the possibility of the consumer missing an update and going to sleep, as long as it handles the case where a second writer adds more data and sends a notify before the consumer finishes ProcessData. (The writers only know that the consumer has started, not when it finishes.) I guess this example isn't showing the notification mechanism, which might itself create synchronization.

If two producers run PrepareData() at overlapping times, the first one to finish will send a notification, not both. Unless the consumer does an exchange and resets is_data_ready between the two exchanges in the producers, then it will get a second notification. (So that sound pretty hard to deal with in the consumer, and in whatever data structure PrepareData() manages, unless it's something like a lock-free queue itself, in which case just check the queue for work instead of this mechanism. But again, this is still a usable example to talk about how exchange works.)

If a consumer is frequently checking and finding no work needing doing, that's also extra contention that could have been avoided if it checks read-only until they see a true and exchange it to false (with an acquire exchange). But since you're worrying about notifications, I assume it's not a spin-wait loop, instead sleeping if there isn't work to do.