The current C++0x draft states on section 29.3.9 and 29.3.10, pages 1111-1112 that in the following example:
// Thread 1
r1 = y.load(memory_order_relaxed);
x.store(1, memory_order_relaxed);
// Thread 2
r2 = x.load(memory_order_relaxed);
y.store(1, memory_order_relaxed);
The outcome r1 = r2 = 1
is possible since the operations of each thread are relaxed and to unrelated addresses. Now my question is about the possible outcomes of the following (similar) example:
// Thread 1
r1 = y.load(memory_order_acquire);
x.store(1, memory_order_release);
// Thread 2
r2 = x.load(memory_order_acquire);
y.store(1, memory_order_release);
I think that in this case the outcome r1 = r2 = 1
is not possible. If it was possible, the load of y would synchronize-with (thus happen-before) the store to y. Similar to x, the load of x would happen-before the store to x. But the load of y is sequenced before (thus also happens-before) the store to x. This creates a cyclic happens-before relation which I think is not allowed.
If we take time (or, instruction sequences if you like) to flow downward, just like reading code, then my understanding is that
In other words, if you have code like
acquire
// other stuff
release
then memory accesses may move from outside the acquire/release pair to the inside, but not the other way around (and they may not skip the acquire/release pair completely either).
With the relaxed consistency semantics in your first example in the question, the hardware can reorder memory accesses such that the stores enter the memory system before the loads, thus allowing r1=r2=1. With the acquire/release semantics in the second example, that reordering is prevented, and thus r1=r2=1 is not possible.