I've been always thinking that if linear address translation process encounters TLB miss then it traverses page directory structure in memory. However Intel Manual Vol.3/4.10.3 defines the so called Paging-Structure Caches which I've not heard about before.
This is what it's done for TLB miss:
If the processor does not find a relevant TLB entry or PDE-cache entry, it may use the upper bits of the linear address (for 4-level paging, bits 47:30; for 5-level paging, bits 56:30) to select an entry from the PDPTE cache that is associated with the current PCID. It can then use that entry to complete the translation process (locating a PDE, etc.) as if it had traversed the PDPTE, the PML4E, and (for 5-level paging) the PML5E corresponding to the PDPTE-cache entry
and
If the processor does not find a relevant TLB entry, PDE-cache entry, or PDPTE-cache entry, it may use the upper bits of the linear address (for 4-level paging, bits 47:39; for 5-level paging, bits 56:39) to select an entry from the PML4E cache that is associated with the current PCID. It can then use that entry to complete the translation process (locating a PDPTE, etc.) as if it had traversed the corresponding PML4E.
So TLB miss does not necessarily means traversing the whole page structure.
Could you give some examples of perf events describing the Page-Structure Caches access and how to optimize for Page-Structure Cache usage?
AFAIK, Skylake doesn't have any perf events for the details of page walks. There are counters for number of walks completed, and number of cycles with walks active, so I guess you could try to average how long each walk took.
(There are two PMH page-miss handlers in Skylake and later, but dtlb_load_misses.walk_pending
counts 1 or 2 per cycle depending on how many are active. Or 0 for neither. But it might only be counting for demand-load TLB misses, not next-page TLB prefetch. There are similar events for stores and code-fetch.
Some other events like dtlb_load_misses.walk_active
counts cycles when one or both page-walkers are active.)
The main way to take advantage of the page walkers caching higher levels of the page table (and/or L2 / L1d also caching those physical locations) is to have locality on a larger scale, like have the hot pages in your working set within the same aligned 2M or 1G regions, so they all share a common upper part of the radix tree (page tables).
Or within a few groups; you don't need to try to get malloc
/ mmap
to allocate next to your code or the stack.
That's mostly up to your OS, unless you do one big allocation and carve it up yourself.
Static code/data (at least in a non-PIE Linux executable) starts at absolute address 4MiB by default, which is at the start of a 2M largepage, and near the start of a 1G hugepage. And very near the start of the 2^9 G level above that. So even if you have a lot of code + data, it's well-positioned. I assume ASLR for non-PIE executables is more granular, but static code+data is usually pretty small compared to even a 1G level.