I have a simple Hello World program written in C, which I statically compiled using: gcc -static -fno-pie -o hello{1|2} hello.c
.
I expected that executing these two binaries would exhibit cache effects and result in fewer cache misses when running them consecutively. However, my observations using perf suggest otherwise.
When running perf on hello1 and hello2, I have the following results:
Running hello1:
perf stat -d -e cache-references,cache-misses,cycles,instructions,minor-faults,major-faults taskset -c 31 ./hello1
63568 cache-references
15597 cache-misses # 24.54% of all cache refs
2106286 cycles
2359692 instructions # 1.12 insn per cycle
87 minor-faults
9 major-faults
623005 L1-dcache-loads
43737 L1-dcache-load-misses # 7.02% of all L1-dcache accesses
5604 LLC-loads
891 LLC-load-misses # 15.90% of all LL-cache accesses
0.009663884 seconds time elapsed
Running hello2:
perf stat -d -e cache-references,cache-misses,cycles,instructions,minor-faults,major-faults taskset -c 31 ./hello2
68070 cache-references
15779 cache-misses # 23.18% of all cache refs
2121759 cycles
2346213 instructions # 1.11 insn per cycle
87 minor-faults
9 major-faults
620314 L1-dcache-loads
43012 L1-dcache-load-misses # 6.93% of all L1-dcache accesses
5656 LLC-loads
825 LLC-load-misses # 14.59% of all LL-cache accesses
0.009523645 seconds time elapsed
Despite running these two identical programs, the cache miss rate remains almost unchanged. The L1, L2, and LLC cache behavior is very similar for both executions. I expected hello2 to benefit from cached memory pages of hello1, but this does not seem to be the case. I repeated this experiment on the same core and different CPU cores, but the results are still nearly identical.
My Questions are:
My OS is Debian 12 (Bookworm) with a kernel 6.1.0-25-amd64.
Linux doesn't by default do memory deduplication, so 2 executables aren't sharing any physical pages (or cache lines) whether their contents are the same or different. Nor do other mainstream OSes.
As far as the kernel is concerned, everything is the same as if you'd compiled two different programs into separate static executables. You'd need to at least cp --reflink=always hello1 hello2
or more reliably ln hello1 hello2
to make both binaries share the same blocks on disk and thus the same pages in the pagecache. Even then that will only help for cache-references to .text
and .rodata
, not .bss
or the stack, or .data
after its written. (MAP_PRIVATE).
Also, unless you ran both perf
commands as part of the same command-line, and probably with taskset -c 1
or something, they'd likely run on separate physical cores which only share L3 (LLC), and/or a core would go into a deep sleep and flush + power off its L1 / L2 caches while you were editing the command line for the second command.
for i in {1,2};
do taskset -c 1 perf stat ... ./hello$i
done