This time i use atomic_fetch_add . Here is how i can get ra1=1 and ra2=1 . Both the threads see a.fetch_add(1,memory_order_relaxed); when a=0. The writes go into store buffer and isn't visible to the other. Both of them have ra=1 and ra2=1.
I can reason how it prints 12,21 and 22.
// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "11"; done doesn't print 11 within 5 mins
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
a.fetch_add(1,memory_order_relaxed);
ra1=a.load(memory_order_relaxed);
}
void bar(){
a.fetch_add(1,memory_order_relaxed);
ra2=a.load(memory_order_relaxed);
}
int main(){
thread t[2]{ thread(foo),thread(bar)};
t[0].join();t[1].join();
printf("%ld%ld\n",ra1,ra2); // This doesn't print 11 but it should
}
a.fetch_add
is atomic; that's the whole point. There's no way for two separate fetch_adds to step on each other and only result in a single increment.
Implementations that let the store buffer break that would not be correct implementations, because ISO C++ requires the entire RMW to be one atomic operation, not atomic-load and separate atomic-store.
(e.g. on x86, lock add [a], 1
is a full barrier because of how it has to be implemented: making sure the updated data is visible in L1d cache as part of executing. Can num++ be atomic for 'int num'?.
On some other implementations, e.g. AArch64 before ARMv8.1, it will compile to an LL/SC retry loop1, where the Store-Conditional will fail if this core lost exclusive ownership of the cache line between the load and store.)
Footnote 1: Actually current GCC will call the libatomic helper function if you omit -march=armv8.1-a
or -mcpu=cortex-a76
or whatever, so it can still benefit via runtime CPU dispatching from using the new single-instruction atomics like ldadd w2, w0, [x0]
instead of a retry loop, in the likely case of the code running on an ARMv8.1 CPU. https://godbolt.org/z/vhePM9h8a)