Search code examples
performancecpu-architecturevirtual-memorymmupage-tables

Does page walk take advantage of shared tables?


Suppose two address spaces share a largish lump of non-contiguous memory. The system might want to share physical page table(s) between them. These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.

There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.

Does the page walk take explicit advantage of this in any known architecture? If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?

Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.


Solution

  • On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.

    What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.

    ("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)


    Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.

    However, I don't think you'll find that PDE caching happening across a change in the top-level page-tables.

    On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).

    The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)

    So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.

    Without hardware-managed coherence between page-table stores and TLB/pagewalk, there's no way this cache could happen safely.

    That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/

    So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.

    Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.

    So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.

    Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.


    The benefit of this would be tiny

    After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.

    So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).

    All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.

    L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.

    If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.


    I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.