multithreading concurrency x86 x86-64 memory-fences

If write to the remote memory over PCIe which marked as WC(Write Combined), then do we have any consistency automatically?

As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations from different locations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c )

And as we know when we write to the remote WC-marked memory over FSB, then CPU uses temporary buffer with size 64 bytes - WCB (Write Combined Buffer)/BIU (Bus Interface Unit). And "When the WCBs are ultimately dumped to external memory over the FSB, data is not necessarily written to memory in the same order in which the earlier programmatic stores were executed." i.e. we have not the automatically acquire-release consistency - qutoted from If we marked memory as WC(Write Combined), then do we have any consistency automatically? See "WCB FSB Transactions" on page 1080 for more information.

But what will happen if we write to the remote WC-marked memory over PCI Express, will we have the automatically acquire-release consistency, when we use MOV or SSE?

Solution

There is no such thing as reordering across different contexts since there's no original order across such writes (aside from anything explicitly maintained by synchronization methods). In other words - if core1 and core2 each write a line, these lines can be observed in any order without braking consistency. The prohibition is on different cores observing different orders for these two lines (i.e. core3 sees the line from core1 first, and core4 sees core2 first). Even that is limited to other cores, cores1 and 2 may each see its own write ahead of the global order (this is a relaxation that x86 does compared to sequential consistency, to allow intra-core forwarding).

What can be potentially reordered are stores within a given program context. Here the order does matter of course, so a program doing -

     thread 0     |   thread 1
 store [x] <-- 1  |   load [y] 
 store [y] <-- 1  |   load [x]

Under the normal x86 memory model (considered to be TSO-like) must preserve that a result of x==0 and y==1 is impossible (assume both were initially zero), since that implies that the stores were reordered. To avoid that, stores will be dispatched in the order maintained by the core's internal queues - even though the execution is done out-of-order, the store may only be seen by the outside world after it is committed (a stage where the reordering buffer restores the original program order). This also guarantees that the store will not be seen if an earlier instruction had an unexpected exception or a branch misprediction.

On the other hand, write-combining allows a more lenient memory ordering model, so stores may be combined and committed whenever the write-combining buffer has the full line. This reduced the bandwidth but allows stores to reorder, for e.g.

store [x] <-- ..
store [z] <-- ..
store [x+8] <-- ..
store [x+16] <-- ..
...

the 2nd store may be reordered ahead of the 1st, since the 1st will wait for the write-combining buffer to fill up. Once the buffer is full (although there's no enforced limit to that), the line is sent out to memory, regardless of any path it has to travel.

The comment about FSB in that other answer doesn't mean it's specific, it dates back to a Pentium 4 guide, so after passing the last level cache, they just assume you go on the FSB. The terms are different nowadays, but anyway - nobody out there cares about ordering any lines, and as I said - once you're no longer within the core, there's no notion of order, only coherency. They just meant that once the line is out it may be observed, and that's the point where the order breaking becomes visible.