What are the implications of designing prefetchers at different cache levels?

I see a lot of papers about prefetching, some of them claim to be used at L1 level, and some claim to be used at LLC level.
Why is there such a restriction?

I know that the L1 data cache is much smaller than the LLC cache, but their data is not stored in the cache, but uses an other table structure. And their prefetchers tend to be only a few KB to tens of KB, which seems to have little effect.

If a new prefetcher is designed, what factors will affect the location of the prefetcher?

Thanks!!!

Solution

I don't have a complete answer to all the tradeoffs, but I can point out some relevant factors.

A prefetcher has to look at some sequence of requests to find patterns (e.g. L2 accesses or L2 misses), and generate requests that fill some level of cache (and usually outer levels, if they're not exclusive). i.e. you have to put it somewhere. In theory different you could have it look at L2 misses but generate prefetch loads that fill L1d, but I suspect that would tend to be worse, evicting useful data or wasting L2<->L1d bandwidth when it was fully used by a workload that misses a lot in L1d but mostly hits in L2.
(Fun fact: Intel Xeons have an option to make the L2 prefetcher fill only L3 (LLC), not L2, despite still looking at L2 requests to decide what to prefetch.

It's normal to have multiple different prefetcher units at different levels of cache; e.g. Intel CPUs have an L2 "streamer" (sequential or strided access within 4k regions, able to track multiple streams), and a "spatial" prefetcher that likes to complete 128-byte-aligned pairs of 64B lines.

Intel CPUs also have an L1d prefetcher that's less aggressive but can see the actual load address, and the program-counter of the instruction (to detect when the same instruction loads a different address, like a loop over an array). vs. L2 only seeing requests for whole lines from the L1 caches, so a prefetcher built in to L2 can't tell the difference between a loop or accessing two members of a large struct.

L2 is larger than L1d, so L1d prefetch can more easily hurt by evicting useful data. But L2 is "close enough" on CPUs with 3 levels of cache and per-core private L2 caches: an L2 miss only has 10 to 12 cycles of latency on typical x86 microarchitectures aimed at high clock frequencies, for example. That's short enough for out-of-order exec to mostly hide, in many cases. So putting the smarter and more aggressive prefetcher in L2 gets the data close without as much risk of downsides. (Intel builds their L2 with a NINE inclusion policy wrt. L1d and L1i, so data can get evicted from L2 without leaving either L1. Which cache mapping technique is used in intel core i7 processor?)

L2 is a unified cache, so the same logic for code and data gets used. This seems somewhat reasonable, and means the prefetch logic doesn't have to get replicated for L1d and L1i. (Each of those can have their own simpler prefetchers what work well most of the time.)

In Intel CPUs, there are about 10 or 12 (Skylake) LFBs (Line Fill Buffers) between the L1d cache and L2, so L1d prefetches use up an LFB to track an incoming cache line. This competes with demand loads, and with cache-miss stores.

By contrast, the queue for requests from L2 going off-core (to L3 cache over the ring bus or mesh) has more entries, 16 if I recall correctly. It's called the "superqueue" in Intel CPUs. With increased parallelism for tracking in-flight cache-line transfers, there's more "room" for HW prefetch requests without hurting memory-level parallelism for demand loads. There might be a chicken/egg effect here, where the choice to have an L2 prefetch unit influenced the choice to have a wider superqueue. But the LFBs have to get snooped by loads (since cache-miss stores can commit to an LFB under limited conditions, as can WC stores) so having more of them would cost power for a larger CAM (content-addressable memory = hardware hash table).

Other microarchitectures may have different names for things, and might make some different choices (like only 2 levels of cache, or different inclusion policy). I chose Intel as an example because I already know the details there. (See In which condition DCU prefetcher start prefetching? for some details from Intel's optimization manual about the prefetchers that exist.)