assembly x86 cpu-architecture cpu-cache tlb

Does INVLPG instruction or mprotect() affect the CPU cache state while invalidating TLB entries?

I am working on some code that involves L1/2/3 cache eviction & TLB entry invalidation. I'm trying to use the INVLPG instruction to invalidate TLB entries and verify some results achieved by mprotect(), but I'm doubtful about both INVLPG and mprotect()'s effects on the cache hierarchy as that is of importance to me.

Specifically: When INVLPG invalidates a TLB entry (or mprotect() downgrades permission which has similar effect on the TLB entry) for given address(es), does either cause any side effects on the CPU cache for the given address(es)? Does it potentially cause a cache entry to be created, or effected?

I couldn't find specific details about their cache interaction.

Solution

On mainstream x86 CPUs for at least the last couple decades, caches including L1i and L1d are physically addressed (VIPT for the L1 caches, but avoiding aliasing). (Which cache mapping technique is used in intel core i7 processor?). So they don't need to be invalidated when a virtual page no longer references the same physical page; any valid entries just keep caching their line of whatever physical page.

Some microarchitectures had larger less-associative L1 caches (like Zen 1's 64K 4-way L1i and K10/Bulldozer's 64K 2-way L1i) that need some tricks to avoid aliasing problems. I think they probably still work as if they're purely physically addressed, in terms of not needing invalidation on invlpg. Some past discussion of such caches includes:

How is AMD's micro-tagged L1 data cache accessed?
Performance implications of aliasing in VIPT cache
https://www.phoronix.com/review/amd_bulldozer_aliasing - different mappings of the same shared library to different virtual addresses creating a performance problem (depending on the index bits of the address above the page offset). This indicates that different mappings definitely can hit on the same cache entry, or not if the indexing is different. So invlpg must not invalidate L1i entries. (And the two cores sharing an L1i can share entries, not needing them to be tagged as belonging to one core or the other the way TLB entries are.)

The uop cache (DSB) in Sandybridge-family is virtually-addressed so does need to be invalidated by invlpg. (I assume it's tagged with PCID (process context id) so it doesn't necessarily have to be invalidated on every change to CR3. It needs at least some context tagging for SMT (hyperthreading).

I don't know as much about AMD Zen's uop cache. Virtually-addressed does presumably shorten hit latency for Intel, and strong decoders make refilling it cheap, so it would make sense if AMD made the same choice.

Does it potentially cause a cache entry to be created, or effected?

No, it shouldn't. invlpg doesn't load or store anything; it should never cause a page-walk (which would go through cache) because it just accessed the TLB itself to detect entries that match the virtual address given. x86 addressing-modes don't include memory-indirect so the address calculation only involves reading registers.

Even AMD's invlpgb (which broadcasts an invalidate to other CPUs, new in Zen 3) probably shouldn't touch memory.

Without it, OSes need to send IPIs (inter-processor interrupts) to shoot down TLB entries on other cores for multi-threaded processes. OS code runs on those CPUs (an interrupt handler) which does of course involve loads and stores.