operating-system cpu-architecture cpu-cache tlb cache-locality

Cache Locality - weight of TLB, Cache Lines, and ...?

From my understanding the constructs which give rise to the high level concept of "cache locality" are the following:

Translation Lookaside Buffer (TLB) for virtual memory translation. Accessing the same virtual memory within the 4096 byte alignment (page size) will prevent the OS from needing to descending the hierarchical page table for translation.
Cache lines mean accessing the same virtual memory within 64 byte alignment (cache line size) will prevent the OS from needing to fetch from RAM for an instruction.

I have a few questions:

I have never once seen a quantitative estimate of the typical page table descent. Is this actually significant as measured in clock cycles?
I believe the 64 byte cache line refers to L1 cache lines - do L2 / L3 have different sizes? Under what circumstances is memory loaded into L2 / L3?
Are there any additional constructs which give rise to "cache locality" aside from cache lines and the TLB?

Solution

1. On most ISAs (including x86 which I suspect you're thinking of), hardware walks the page tables on a TLB miss, not the OS. The OS just puts the data structures in memory and gives the CPU the physical address of the top-level page directory. What happens after a L2 TLB miss?. Thus, page-walk can be done speculatively ahead of actually needing the TLB entry, ideally hiding most of the latency.

The actual latency for a load that suffers a TLB miss (but an L1d hit for the data) would tell you something about page-walk latency, on whatever microarch you're measuring. I don't have a number in mind for Skylake or anything; the practical cost would also depend on how much caching of higher levels of the page table was done inside the page-walk hardware. (So that's another source of locality; a page-walk within the same 1GiB as another recent page walk could be faster, even without just using a 1G or 2M huge/large page so one TLB entry can cover more address space.)

2. Some microarchitectures use larger lines for L2 or L3, but most don't. All modern x86 CPUs use 64B lines everywhere. (But L2 spatial prefetchers on Intel at least try to complete a 128-byte aligned pair of lines.) Line size of L1 and L2 caches

Under what circumstances is memory loaded into L2 / L3?

Depends on the cache inclusion policy, e.g. an exclusive outer cache won't have a copy of something just loaded into L1d, neither will a victim cache (wikipedia, although large L3 victim caches are not fully associative). In the x86 world, Intel hasn't used normally victim caches (Which cache mapping technique is used in intel core i7 processor?), but AMD has in some microarchitectures (e.g. Bulldozer-family). POWER has also used L3 victim caches.

Are there any additional constructs which give rise to "cache locality" aside from cache lines and the TLB?

Yes, DRAM "pages" (row size) means multiple cache misses within one page can let the DRAM controller avoid selecting a different row, and just read another column from an already open row. Changing rows increases DRAM latency beyond the normal cost.

What Every Programmer Should Know About Memory? covers DRAM, and lots of stuff about cache and optimizing for cache locality, and is still highly relevant.

Also, as mentioned above, page walks for nearby pages may be somewhat faster.

A large / hugepage (e.g. 2MiB on x86-64) lets one TLB entry cover the whole 2M.

Sequential reads (or writes) of consecutive cache lines trigger HW prefetchers to pull those lines into L2 (and even L1d) ahead of demand access, reducing miss latency. (Or avoiding misses altogether if the loop takes long enough on ALU work that HW prefetch can keep up with its accesses.)