C++0X memory_order without fences, applications, chips that support

As a followup from my previous question, the atomic<T> class specifies most operations with a memory_order parameter. In contrast to a fence this memory order affects only the atomic on which it operates. Presumably by using several such atomics you can build a concurrent algorithm where the ordering of other memory is unimportant.

So I have two questions:

Can somebody point me to an example of an algorithm/situation that would benefit from the ordering of individual atomic variables and not require fences?
Which modern processors support this type of behavior? That is, where the compiler won't just translate the specific order into a normal fence.

Solution

The memory ordering parameter on operations on std::atomic<T> variables does not affect the ordering of that operation per se, it affects the ordering relationships that operation creates with other operations.

e.g. a.store(std::memory_order_release) on its own tells you nothing about how operations on a are ordered with respect to anything else, but paired with a call to a.load(std::memory_order_acquire) from another thread, this then order other operations --- all writes to other variables (including non-atomic ones) done by the thread that did the store to a are visible to the thread that did the load, if that load reads the value stored.

On modern processors, some memory orderings on operations are no-ops. e.g. on x86, memory_order_acquire, memory_order_consume and memory_order_release are implicit in the load and store instructions, and do not require separate fences. In these cases the orderings just affect the instruction reordering the compiler can do.

Clarification: The implicit fences in the instructions can mean that the compiler does not need to issue any explicit fence instructions if all the memory ordering constraints are attached to individual operations on atomic variables. If you use memory_order_relaxed for everything, and add explicit fences then the compiler may well have to explicitly issue those fences as instructions.

e.g. on x86, the XCHG instruction carries with it an implicit memory_order_seq_cst fence. There is thus no difference between the generated code for the two exchange operations below on x86 --- they both map to a single XCHG instruction:

std::atomic<int> ai;
ai.exchange(3,std::memory_order_relaxed);
ai.exchange(3,std::memory_order_seq_cst);

However, I'm not yet aware of any compiler that get rid of the explicit fence instructions in the following code:

std::atomic_thread_fence(std::memory_order_seq_cst);
ai.exchange(3,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);

I expect compilers will handle that optimization eventually, but there are other similar cases where the implicit fences will allow better optimization.

Also, std::memory_order_consume can only be applied to direct operations on variables.