assembly x86 x86-64 cpu-architecture cpu-cache

Is L2 line fill always triggered on lookup?

It's a well-documented fact that L2 is non-unclusive with respect to L1D meaning that L2 does not have to contain all lines L1DCache has.

Can L1d miss (Read, RFO) that also misses L2 fill the L1d line without filling the corresponding L2 line? Is there any explanation of that in Intel mans? Update: There is. Intel Vol.3, section about memory type.

Or rephrasing the question in another way: Does a lookup missing L2 always cause its line to be filled?

After some digging in I discovered the answer by myself. It is a property of Write-back memory type, not a cache level

Write-back (WB) — Writes and reads to and from system memory are cached. Reads come from cache lines on cache hits; read misses cause cache fills.

Solution

The answer depends on the cache inclusion policy of the outer caches. We can safely assume that read-allocate happens in any cache level unless otherwise specified (exclusive or victim cache).

On Intel, NT prefetch can bypass L2 (just filling L1d and a single way of L3, for example, on Intel CPUs with inclusive L3), but normal demand loads are fetched through L2 and do allocate in L2 as well as L1d. (And SW prefetch other than prefetchnta)

The above applies to most CPUs (NINE L2). But some microarchitectures have exclusive L2/L1d and thus no, only allocating in L1d at first, with the line moving to L2. AMD has experimented more with exclusive caches than Intel.

AMD has built some CPUs with exclusive and/or victim caches, e.g. Zen's per-CCX L3 is a victim cache for the L2 caches in that complex of 4 cores (https://en.wikichip.org/wiki/amd/microarchitectures/zen#Memory_Hierarchy, https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/9). Skylake-X / Cascade Lake's non-inclusive L3 is also a victim cache for L2.

In those CPUs, reads don't allocate in L3, only L2 and L1d. (Or L1i for code fetches).

Barcelona (aka K10) has a shared L3, and an L1/L2 that are exclusive of each other (source: David Kanter's excellent writeup). So on K10, yes a line allocated in L1d will definitely not be allocated in L2. The line evicted from L1d to make room for the new line will typically be moved to L2, evicting an older line from L2.

K8 had the same L2 exclusive of L1d, but no shared L3.

It is a property of Write-back memory type, not a cache level ... read misses cause cache fills.

Intel's vol.3 manual is just abstract guarantees that are future proof. That's only guaranteeing that it will be cache somewhere in the cache hierarchy.

For any sane design that will include in L1d in anticipation of other reads of the same line (immediate spatial locality is very common). But it doesn't have to include L2 or even L3 right away, depending on the design. i.e. it doesn't mean all levels.

x86 doesn't guarantee anything on paper about having more than one level of cache. (Or even that there is a cache, except for the parts of the ISA docs about cache-as-RAM mode and stuff like that.) The docs are written assuming a CPU with at least 2 levels because that's been the case since P6 (and P5 with motherboards that provided an L2 cache), but anything like clflush should be read as "assuming there is a cache".