I read this post.
The answer interestingly points out:
You do in fact need to modify your code to not use C library functions on volatile buffers. Your options include:
- Write your own alternative to the C library function that works with volatile buffers.
- Use a proper memory barrier.
I am curious how #2 is possible. Let's say 2 (single threaded) processes use shm_open()
+ memcpy()
to create/open the same shared memory on CentOS 7.
And I am using gcc/g++ 7 and on x86-64.
Roll your own compiler memory barrier, to tell the compiler that all global variables may have been asynchronously modified.
In C++11 and later, the language defines a memory model which specifies that data races on non-atomic variables is undefined behaviour. So although this still works in practice on modern compilers, we should probably only talk about about C++03 and earlier. Before C++11, you had to roll your own, or use pthreads library functions or whatever other library.
Related: How does a mutex lock and unlock functions prevents CPU reordering?
In GNU C asm("" ::: "memory")
is a compiler memory barrier. On x86, a strongly-ordered architecture, this alone gives you acq_rel semantics because the only kind of runtime reordering x86 can do is StoreLoad.
The optimizer treats it exactly like a function call to a non-inline function: any memory that anything outside this function could have a pointer to is assumed to be modified. See Understanding volatile asm vs volatile variable. (A GNU C extended asm statement with no outputs is implicitly volatile
, so asm volatile("" ::: "memory")
is more explicit but equivalent.)
See also http://preshing.com/20120625/memory-ordering-at-compile-time/ for more about compiler barriers. But note that this isn't just blocking reordering, it's blocking optimizations like keeping the value in a register in a loop.
e.g. a spin loop like while(shared_var) {}
can compile to if(shared_var) infinite_loop;
, but with a barrier we can prevent that:
void spinwait(int *ptr_to_shmem) {
while(shared_var) {
asm("" ::: "memory");
}
}
gcc -O3 for x86-64 (on the Godbolt compiler explorer) compiles this to asm that looks like the source, without hoisting the load out of the loop:
# gcc's output
spinwait(int*):
jmp .L5 # gcc doesn't check or know that the asm statement is empty
.L3:
#APP
# 3 "/tmp/compiler-explorer-compiler118610-54-z1284x.occil/example.cpp" 1
#asm comment: barrier here
# 0 "" 2
#NO_APP
.L5:
mov eax, DWORD PTR [rdi]
test eax, eax
jne .L3
ret
The asm
statement is still a volatile asm statement which has to run exactly as many times as the loop body runs in the C abstract machine. GCC jumps over the empty asm statement to reach the condition at the bottom of the loop to make sure the condition is checked before running the (empty) asm statement. I put an asm comment in the asm template to see where it ends up in the compiler-generated asm for the whole function. We could have avoided this by writing a do{}while()
loop in the C source. (Why are loops always compiled into "do...while" style (tail jump)?).
Other than that, it's the same as the asm we get from using std::atomic_int
or volatile
. (See the Godbolt link).
Without the barrier, it does hoist the load:
# clang6.0 -O3
spinwait_nobarrier(int*): # @spinwait_nobarrier(int*)
cmp dword ptr [rdi], 0
je .LBB1_2
.LBB1_1: #infinite loop
jmp .LBB1_1
.LBB1_2: # jump target for 0 on entry
ret
Without anything compiler-specific, you could actually use a non-inline function to defeat the optimizer, but you might have to put it in a library to defeat link-time optimization. Merely another source file is not sufficient. So you end up needing a system-specific Makefile or whatever. (And it has runtime overhead).