AMD: performance counter for cycles on TLB miss

I'm looking for AMD specific performance counters which count cycles consumed by page walks when TLB misses occur. I know Intel has such metrics available.

But do such exist on AMD? I looked in http://developer.amd.com/wordpress/media/2013/12/56255_OSRR-1.pdf but didn't find anything close to what I need.

I also looked in perf source code https://elixir.bootlin.com/linux/latest/source/arch/x86/events/amd/core.c#L248 It does not seem to have either.

May be it has different names? Any suggestions?

Solution

It seems to me you're looking for events similar to Intel's *.WALK_DURATION or *.WALK_ACTIVE on AMD Zen processors. There are no such events with the same exact meaning, but there are similar events.

The closest events are the IBS performance data fields IbsTlbRefillLat and IbsItlbRefillLat, which measure the number cycles it takes to fulfill an L1 DTLB or L1 ITLB miss, respectively, in case of miss for the selected instruction fetch or uop. Note that in perf record, IbsTlbRefillLat can be captured with the ibs_fetch PMU and IbsItlbRefillLat can be captured with the ibs_op PMU.

The event Core::X86::Pmc::Core::LsTwDcFills is also useful. It counts the number of L1 data cache fills for page table walks that miss in the L1 for each data source (local L2, L3 on the same die, L3 on another die, DRAM or IO on the same die, DRAM or IO on another die). Walks fulfilled from farther sources are more expensive and would probably have a larger impact on performance. This event doesn't count walks that hit in the L1 data cache, although there are other events that count L2 TLB misses. Also, this event only count for L2 DTLB misses and not ITLB misses.

In current versions of upstream kernel, LsTwDcFills is not listed by perf list and so perf doesn't know the event by name. So you'll have specify the event code using the syntax cpu/event=0x5B, umask=0x0/. This event represents any page table walk for a data load or store for which there is an allocated MAB (meaning that the walker missed in the L1D). You can filter the count according to the response by specifying an appropriate umask value as defined in the manual. For example, the event cpu/event=0x5B, umask=0x48/ represents a walk where the response came from local or remote main memory.

One good approach for utilizing all of these monitoring facilities as a small part of your overall microarchitectural performance analysis methodology is to first monitor LsTwDcFills. If it exceeds some threshold compared to the total number of memory accesses (excluding instruction fetches), then capture IbsTlbRefillLat for sampled uops to locate where in your code these expensive walks are occurring. Similarly, for instruction fetch walks, use the event Core::X86::Pmc::Core::BpL1TlbMissL2Hit for counting total walks and if the count is too large with respect to total fetches, use IbsItlbRefillLat to locate where in your code the most expensive walks are occurring.