If a CPU core uses a write buffer, then the load can bypass the most recent store to the referenced location from the write buffer, without waiting until it will appear in the cache. But, as it's written in A Primer on Memory Consistency and Coherence, if the CPU honors TSO memory model, then
... multithreading introduces a subtle write buffer issue for TSO. TSO write buffers are logically private to each thread context (virtual core). Thus, on a multithreaded core, one thread context should never bypass from the write buffer of another thread context. This logical separation can be implemented with per-thread-context write buffers or, more commonly, by using a shared write buffer with entries tagged by thread-context identifiers that permit bypassing only when tags match.
I can't grasp the necessity of this limitation. Could you please give me an example when allowing some thread to bypass a write buffer entry written by another thread on the same core leads to the violation of the TSO memory model?
The classic example of how TSO differs from sequential consistency (SC) is:
(This is example 2.4 here - http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf)
thread 0 | thread 1
---------------------------------
write 1-->[x] | write 1-->[y]
a = read [x] | b = read [y]
c = read [y] | d = read [x]
Both addresses store 0 initially. The question is: would c=d=0 be a valid outcome? We know a and b must forward the stores before them since they match the addresses of the local stores, and will probably be forwarded from the local threads store buffer. However, c and d may not be forwarded across context, so they may still show the old value.
The interesting gotcha here is that since each thread observes both stores, and forwards the local one, and outcome of a=1,c=0 would mean that t0 saw the store to [x] occurring first. An outcome of b=1,d=0 would mean that t1 saw the store to [y] occurring first. The fact that this is a possible outcome due to store buffer forwarding would break sequential consistency as it requires that all contexts agree on the same global order of stores. Instead, x86 settled for a weaker TSO model that allows this case.
Forwarding stores globally is practically impossible since buffered stores are not necessarily committed, which means they may even be in the wrong path of a branch misprediction. Forwarding locally is fine since a flush would also eliminate all the loads that forwarded from them, but on multiple contexts you don't have that. I've also seen work that tries to buffer stores globally outside of the core, but this is not very practical due to latency and bandwidth. For further reading, here's a recent paper that may be relevant - http://ieeexplore.ieee.org/abstract/document/7783736/