I'm trying to use perf stat to fetch hardware counter information for a benchmark on Intel's Xeon processor (based on Skylake). When I provide the -e LLC-loads -d -d -d
flag, perf stat prints out LLC-loads twice - one due to -e LLC-loads
and the other due to detailed flag turned on. However, the results are inconsistent:
$ perf stat -e LLC-loads,LLC-stores,L1-dcache-loads,L1-dcache-stores -d -d -d <my benchmark executable>
Performance counter stats for '<my benchmark executable>':
5145246847 LLC-loads (30.78%)
8167130238 LLC-stores (30.80%)
198057619358 L1-dcache-loads (30.80%)
83142567530 L1-dcache-stores (30.80%)
197792116698 L1-dcache-loads (30.79%)
27391515211 L1-dcache-load-misses # 13.84% of all L1-dcache hits (30.78%)
5114059688 LLC-loads (30.78%)
3025020254 LLC-load-misses # 58.97% of all LL-cache hits (30.76%)
<not supported> L1-icache-loads
58697135 L1-icache-load-misses (30.75%)
198322967573 dTLB-loads (30.74%)
209105723 dTLB-load-misses # 0.11% of all dTLB cache hits (30.72%)
2639992 iTLB-loads (30.74%)
1368656 iTLB-load-misses # 51.84% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
25.301480157 seconds time elapsed
585.222699000 seconds user
1.070800000 seconds sys
As can be seen in the output, there are two LLC-loads in the output with different values. What am I getting wrong?
I've tried multiple different benchmarks assuming that it could be benchmark specific but this behavior is observed everywhere.
Note the multiplexing because you specified so many events: they were sampled for (30.78%)
of the total time, with the number extrapolated from that. Skylake only has 4 programmable counters per logical core that can be counting different hardware events at once.
Your program isn't 100% uniform with time, and there's sampling / extrapolation noise, so the numbers are close but differ by a couple %. (The multiplexing code didn't combine an event specified twice, instead it just put two instances of it into the rotation.)
If you just counted two instances of the event without many other events, you'd expect exactly equal counts since they'd both be active at the same time on different HW counters. (Unless the first counter would count any events after being programmed, while the kernel was still programming the second. --all-user
would avoid that, telling the HW to count only when the logical core was in user-space.) e.g.
$ perf stat -e LLC-loads,LLC-loads cmp /dev/zero /dev/full
^Ccmp: Interrupt
Performance counter stats for 'cmp /dev/zero /dev/full':
31,425 LLC-loads
31,425 LLC-loads
2.748813842 seconds time elapsed
1.113722000 seconds user
1.633880000 seconds sys
(Small number of counts, I guess cmp
uses buffers small enough to fit in L3 cache. I used two different files that would read as all-zeros so it couldn't just detect they were identical.)
Related:
instructions:D
and cycles:D
will tell perf to always count those; there are dedicated non-programmable counters for those events on Intel CPUs, but the multiplexing code doesn't know that. You could do this with other events, but that would take away slots from events where you didn't specify :D
.