Cache Effects in Statically Compiled Binaries: Unexpected Cache Misses

I have a simple Hello World program written in C, which I statically compiled using: gcc -static -fno-pie -o hello{1|2} hello.c.

I expected that executing these two binaries would exhibit cache effects and result in fewer cache misses when running them consecutively. However, my observations using perf suggest otherwise.

When running perf on hello1 and hello2, I have the following results:

Running hello1:

perf stat -d -e cache-references,cache-misses,cycles,instructions,minor-faults,major-faults taskset -c 31 ./hello1

             63568      cache-references
             15597      cache-misses                     #   24.54% of all cache refs
           2106286      cycles
           2359692      instructions                     #    1.12  insn per cycle
                87      minor-faults
                 9      major-faults
            623005      L1-dcache-loads
             43737      L1-dcache-load-misses            #    7.02% of all L1-dcache accesses
              5604      LLC-loads
               891      LLC-load-misses                  #   15.90% of all LL-cache accesses

       0.009663884 seconds time elapsed

Running hello2:

perf stat -d -e cache-references,cache-misses,cycles,instructions,minor-faults,major-faults taskset -c 31 ./hello2

             68070      cache-references
             15779      cache-misses                     #   23.18% of all cache refs
           2121759      cycles
           2346213      instructions                     #    1.11  insn per cycle
                87      minor-faults
                 9      major-faults
            620314      L1-dcache-loads
             43012      L1-dcache-load-misses            #    6.93% of all L1-dcache accesses
              5656      LLC-loads
               825      LLC-load-misses                  #   14.59% of all LL-cache accesses

       0.009523645 seconds time elapsed

Despite running these two identical programs, the cache miss rate remains almost unchanged. The L1, L2, and LLC cache behavior is very similar for both executions. I expected hello2 to benefit from cached memory pages of hello1, but this does not seem to be the case. I repeated this experiment on the same core and different CPU cores, but the results are still nearly identical.

My Questions are:

Is the page cache per process or shared across processes? Is it related to the fact that binaries are statically-linked?
Why are cache misses not significantly reduced? Shouldn’t the second execution benefit from cached instructions and data? Is due to software/hardware mitigations like spectre/meltdown etc?

My OS is Debian 12 (Bookworm) with a kernel 6.1.0-25-amd64.

Solution

Linux doesn't by default do memory deduplication, so 2 executables aren't sharing any physical pages (or cache lines) whether their contents are the same or different. Nor do other mainstream OSes.

As far as the kernel is concerned, everything is the same as if you'd compiled two different programs into separate static executables. You'd need to at least cp --reflink=always hello1 hello2 or more reliably ln hello1 hello2 to make both binaries share the same blocks on disk and thus the same pages in the pagecache. Even then that will only help for cache-references to .text and .rodata, not .bss or the stack, or .data after its written. (MAP_PRIVATE).

Also, unless you ran both perf commands as part of the same command-line, and probably with taskset -c 1 or something, they'd likely run on separate physical cores which only share L3 (LLC), and/or a core would go into a deep sleep and flush + power off its L1 / L2 caches while you were editing the command line for the second command.

for i in {1,2};
   do  taskset -c 1  perf stat ...  ./hello$i
done