caching architecture system memory-model

Memory models-/Cache coherence protocols: How TSO goes together with MESIF

Having just worked through my system's programming lecture material, I stumbled upon the crucial concepts of memory models as well as cache coherence protocols. Although they make sense as independent concepts, it is not really clear how they go together. Specifically, when looking at x86, I am working with an ISA enforcing the TSO memory model, and a CPU (in the case of Intel) using the MESIF cache coherence protocol.

In the beginning, the professor introduced cache coherence protocols as means of ensuring that to any core in the chip, it appears as if they all access one large, monolithic block of memory. Then, after wrapping up with cache coherence, he continued with memory models, specifically TSO (we were introduced to linearizability-/sequential consistency in our parallel programming class already). The following is a direct quote from the lecture material about the x86 memory model:

Standard for 64-bit x86 processors

Sometimes called Total Store Ordering (TSO)

Earlier 32-bit x86 implemented PRAM – weaker!

Write-to-read relaxation: later reads can bypass earlier writes

All processors see writes from one processor in the order they were issued.

Processors can see different interleavings of writes from different processors.

It seems as if we "solved" the problem of slow sequential consistency by introducing what is (yet another) layer in the cache hierarchy, namely the (ordered) store buffer. To me, TSO seems orthogonal to the principles of cache coherence. We worked so hard to get our caches to match, only to add another layer in between not covered by cache coherence.

Questions:

Why are the store buffers not covered by cache coherence protocols? Is it assumed that writeback to L1 from store buffers is so fast, that inconsistencies due to intermediate writebacks would not be an issue in the majority of cases? (i.e. I reckon store buffer -> L1 transfer only takes a few cycles, so as soon as L1 receives the data, it sends a transaction across the bus, telling other cores to invalidate their copy)
How should I think of the two concepts of cache coherence as well as memory model? The way I understand it, memory models are the theoretical concepts of what we want, and cache coherence is part of the practical implementation to achieve said model.

Thank you so much in advance for the clarifications!

Best, Felix

Solution

The sequential consistency model is the most commonly prescribed memory model for shared memory parallel programming. A parallel-program comprising multiple tasks or threads, sequential consistency requires two conditions as described below.

Program order execution: All memory operations in each task appear to execute in that task's program order.
Memory access atomicity: Memory operations (in all tasks of the parallel program) appear to execute one-at-a-time.

Every programmer assumes these conditions to reason about their parallel programs. Unfortunately, sequential consistency is a less useful model than imagined. The main reason is the implementation cost of these two properties. Enforcing these properties prohibit many basic compiler and hardware optimizations[1]. Other weak/relaxed memory models are proposed that relax these properties and allow compiler and hardware optimizations. These weak memory models trade programmability for performance.

Why are the store buffers not covered by cache coherence protocols?

It is a design choice of TSO for performance reasons. Serving a load from the store buffer or serving a load when its preceding store (of different address) is still in the store buffer, reduces the store latency. To keep store buffers coherent, the load has to wait until all other processors have acknowledged receipt of the invalidates generated by the store. Moreover, most of the time, there may not be any copies of the store address in other caches (the variable is local to a task), then waiting for the acknowledges is a waste of time. In case other tasks share this variable, then waiting for the acknowledges can be explicitly enforced by using atomic or fence instructions.

How should I think of the two concepts of cache coherence as well as memory model?

Cache coherence protocols are concerned with serializing stores to the same memory location and ensuring that a load returns the value of the most recent store to the same memory location. Cache coherence protocols are required only when there are caches or multiple copies of the same memory location, and its job is to keep all the copies coherent.

A memory consistency model is concerned with the relative order of loads and stores (of the same task) to different memory locations. Any system that involves executing shared-memory parallel programs (multiple tasks or threads communicating through a shared memory) must define its memory consistency model.

Broadly, cache coherence protocols implement a part of the memory consistency model. More precisely, it is the combination of core pipeline, and cache coherence protocols (and every other component that a memory instruction traverses) must adhere to the memory model specifications. [1]: Shared memory consistency models: a tutorial