c#.net performance volatile memory-model

Memory Model: preventing store-release and load-acquire reordering

It is known that, unlike Java's volatiles, .NET's ones allow reordering of volatile writes with the following volatile reads from another location. When it is a problem MemoryBarier is recommended to be placed between them, or Interlocked.Exchange can be used instead of volatile write.

It works but MemoryBarier could be a performance killer when used in highly optimized lock-free code.

I thought about it a bit and came with an idea. I want somebody to tell me if I took the right way.

So, the idea is the following:

We want to prevent reordering between these two accesses:

 volatile1 write

 volatile2 read

From .NET MM we know that :

 1) writes to a variable cannot be reordered with  a  following read from 
    the same variable
 2) no volatile accesses can be eliminated
 3) no memory accesses can be reordered with a previous volatile read

To prevent unwanted reordering between write and read we introduce a dummy volatile read from the variable we've just written to:

 A) volatile1 write
 B) volatile1 read [to a visible (accessible | potentially shared) location]
 C) volatile2 read

In such case B cannot be reordered with A as they both access the same variable, C cannot be reordered with B because two volatile reads cannot be reordered with each other, and transitively C cannot be reordered with A.

And the question:

Am I right? Can that dummy volatile read be used as a lightweight memory barrier for such scenario?

Solution

I forgot to post the soon found answer back to SO. Better late than never..

Turns out it is impossible thanks to how processors (at least x86-x64 kind of them) optimize memory accesses. I found the answer when was reading Intel manuals on its procs. Example 8-5:" Intra-Processor Forwarding is Allowed" was looking suspicious. Googling for "store buffer forwarding" lead to Joe Duffy's blog posts (first and second - read them pls).

To optimize writes processor uses store buffers (per processor queues of write ops). Buffering writes locally allows it to do next optimization: satisfying reads from the previously buffered writes to the same memory location and which haven't left the processor yet. The technique is called store-buffer forwarding (or store-to-load forwarding).

The end result in our case is that as reading at B is satisfied from a local storage (store buffer) it is not considered a volatile read and can be reordered with further volatile reads from another memory location (C).

It seems like a violation of the rule "Volatile reads don't reorder with each other". Yes, it is a violation, but very rare and exotic one. Why did it happen? Probably because Intel's released its first formal document on memory model of its processors years after .NET (and its JIT compiler) saw the sunlight.

So the answer is: no, the dummy reading (B) doesn't prevent reordering between A and C and cannot be used as a lightweight memory barrier.