c++multithreading memory-barriers stdatomic

Output 11 for this program never occurs

This time i use atomic_fetch_add . Here is how i can get ra1=1 and ra2=1 . Both the threads see a.fetch_add(1,memory_order_relaxed); when a=0. The writes go into store buffer and isn't visible to the other. Both of them have ra=1 and ra2=1.

I can reason how it prints 12,21 and 22.

22 is given by both of them incrementing a in foo and bar and a=2 is visible to both a.load.
Similar 12 is given by thread foo completing and thread bar start after thread foo store.
21 is given by first bar then foo.

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "11"; done doesn't print 11 within 5 mins
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
        a.fetch_add(1,memory_order_relaxed);
        ra1=a.load(memory_order_relaxed);
}
void bar(){
        a.fetch_add(1,memory_order_relaxed);
        ra2=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  printf("%ld%ld\n",ra1,ra2); // This doesn't print 11 but it should
}

Solution

a.fetch_add is atomic; that's the whole point. There's no way for two separate fetch_adds to step on each other and only result in a single increment.

Implementations that let the store buffer break that would not be correct implementations, because ISO C++ requires the entire RMW to be one atomic operation, not atomic-load and separate atomic-store.

(e.g. on x86, lock add [a], 1 is a full barrier because of how it has to be implemented: making sure the updated data is visible in L1d cache as part of executing. Can num++ be atomic for 'int num'?.

On some other implementations, e.g. AArch64 before ARMv8.1, it will compile to an LL/SC retry loop¹, where the Store-Conditional will fail if this core lost exclusive ownership of the cache line between the load and store.)

Footnote 1: Actually current GCC will call the libatomic helper function if you omit -march=armv8.1-a or -mcpu=cortex-a76 or whatever, so it can still benefit via runtime CPU dispatching from using the new single-instruction atomics like ldadd w2, w0, [x0] instead of a retry loop, in the likely case of the code running on an ARMv8.1 CPU. https://godbolt.org/z/vhePM9h8a)