According to this https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html,
a released store is implemented as MOV
(into memory) on x86 (including x86-64).
According to his http://en.cppreference.com/w/cpp/atomic/memory_order
memory_order_release:
A store operation with this memory order performs the release operation: no memory accesses in the current thread can be reordered after this store. This ensures that all writes in the current thread are visible in other threads that acquire or the same atomic variable and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic.
I understand that when memory_order_release is used, all memory stores done previously should finish before this one.
int a;
a = 10;
std::atomic<int> b;
b.store(50, std::memory_order_release); // i can be sure that 'a' is already 10, so processor can't reorder the stores to 'a' and 'b'
QUESTION: how is it possible that a bare MOV
instruction (without an explicit memory fence) is sufficient for this behaviour? How does MOV
tell the processor to finish all previous stores?
That does appear to be the mapping, at least in code compiled with the Intel compiler, where I see:
0000000000401100 <_Z5storeRSt6atomicIiE>:
401100: 48 89 fa mov %rdi,%rdx
401103: b8 32 00 00 00 mov $0x32,%eax
401108: 89 02 mov %eax,(%rdx)
40110a: c3 retq
40110b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000401110 <_Z4loadRSt6atomicIiE>:
401110: 48 89 f8 mov %rdi,%rax
401113: 8b 00 mov (%rax),%eax
401115: c3 retq
401116: 0f 1f 00 nopl (%rax)
401119: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
for the code:
#include <atomic>
#include <stdio.h>
void store( std::atomic<int> & b ) ;
int load( std::atomic<int> & b ) ;
int main()
{
std::atomic<int> b ;
store( b ) ;
printf("%d\n", load( b ) ) ;
return 0 ;
}
void store( std::atomic<int> & b )
{
b.store(50, std::memory_order_release ) ;
}
int load( std::atomic<int> & b )
{
int v = b.load( std::memory_order_acquire ) ;
return v ;
}
The current Intel architecture documents, Volume 3 (System Programming Guide), does a nice job explaining this. See:
8.2.2 Memory Ordering in P6 and More Recent Processor Families
The full memory model is explained there. I'd assume that Intel and the C++ standard folks have worked together in detail to nail down the best mapping for each of the memory order operations possible with that conforms to the memory model described in Volume 3, and plain stores and loads have been determined to be sufficient in those cases.
Note that just because no special instructions are required for this ordered store on x86-64, doesn't mean that will be universally true. For powerpc I'd expect to see something like a lwsync instruction along with the store, and on hpux (ia64) the compiler should be using a st4.rel instruction.