When CPU flush value in storebuffer to L1 Cache?

Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?

Solution

It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.

In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point¹ the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.

In short, invalidations from other cores have little effect on stores in the store buffer², since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.

RFO Response

Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.

This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.

Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.

One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).

Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!

For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:

Broadwell offers four different snoop modes a reintroduction of Home Snoop with Directory and Opportunistic Snoop Broadcast (HS with DIR + OSB) previously available on Ivy Bridge, and three snoop modes that were available on Haswell, Early Snoop, Home Snoop, and Cluster on Die Mode (COD). Table 5 maps the memory bandwidth and latency trade-offs that will vary across each of the different modes. Most workloads will find that Home Snoop with Directory and Opportunistic Snoop Broadcast will be the best choice.

... with different performance tradeoffs:

The rest that document goes into some detail about how the various modes work.

So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".

¹ Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.

² Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.