Performance of CPU-only code varies with executable file name (!!!)

Reproducing the issue

Run the following shell script on an Apple Silicon Mac (tried on a MacBook Air 2020 with M1 and a MacBook Pro 16" with M3 Max; both running macOS Sonoma, with Apple clang 15.0):

#!/bin/bash

rm -rf NTRU
git clone https://github.com/vector-polymul-ntru-ntrup/NTRU.git
cd NTRU/ntruhps2048677/aarch64_tmvp
cat << EOF > speed_polymul.c
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "api.h"
#include "params.h"
#include "poly.h"
#include "tmvp.h"
#include "batch_multiplication.h"

#define NTESTS 1000

uint64_t time0, time1;
uint64_t cycles[NTESTS];

#ifdef __APPLE__

#include "m1cycles.h"
#define __AVERAGE__
#define SETUP_COUNTER() {(void)cycles; setup_rdtsc();}
#define CYCLE_TYPE "%lld"
#define GET_TIME rdtsc()

#else

#include "hal.h"
#undef __AVERAGE__
#define __MEDIAN__
#define SETUP_COUNTER() {}
#define CYCLE_TYPE "%ld"
#define GET_TIME hal_get_time()


static int cmp_uint64(const void *a, const void *b){
    return ((*((const uint64_t*)a)) - ((*((const uint64_t*)b))));
}

#endif

#ifdef __AVERAGE__

#define LOOP_INIT(__clock0, __clock1) { \
    __clock0 = GET_TIME; \
}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
    __clock1 = GET_TIME; \
    printf(__f_string, (__clock1 - __clock0) / NTESTS); \
}
#define BODY_INIT(__clock0, __clock1) {}
#define BODY_TAIL(records, __clock0, __clock1) {}

#elif defined(__MEDIAN__)

#define LOOP_INIT(__clock0, __clock1) {}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
    qsort(records, sizeof(uint64_t), NTESTS, cmp_uint64); \
    printf(__f_string, records[NTESTS >> 1]); \
}
#define BODY_INIT(__clock0, __clock1) { \
    __clock0 = GET_TIME; \
}
#define BODY_TAIL(records, __clock0, __clock1) { \
    __clock1 = GET_TIME; \
    records[i] = __clock1 - __clock0; \
}

#endif

#define WRAP_FUNC(__f_string, records, __clock0, __clock1, func){ \
    LOOP_INIT(__clock0, __clock1); \
    for(size_t i = 0; i < NTESTS; i++){ \
        BODY_INIT(__clock0, __clock1); \
        func; \
        BODY_TAIL(records, __clock0, __clock1); \
    } \
    LOOP_TAIL(__f_string, records, __clock0, __clock1); \
}

poly src1, src2;
poly des;

int main(int argc, char *argv[])
{
    int iters = 1;

    SETUP_COUNTER();

    if (argc > 1) {
        iters = atoi(argv[1]);
    }

    if (iters == 1) {
        WRAP_FUNC("polymul (Z_q[x] / (x^677 - 1)): " CYCLE_TYPE " \n",
                cycles, time0, time1,
                poly_Rq_mul(&(des), &(src1), &(src2)));
    }

    return 0;
}
EOF

make -j16 speed_polymul

cp speed_polymul abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

echo Type your password for sudo now
sudo echo Password OK

sudo ./speed_polymul > /dev/null # First run is slower (possibly has to do with verifying code signature)
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ > /dev/null # First run is slower (possibly has to do with verifying code signature)

echo Running speed_polymul 5 times
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul

echo Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

(I apologize for the need for sudo, but the cycle counters in Apple Silicon CPUs can only be accessed as root. I understand if anyone is paranoid and would rather not run this code.)

What this script does is:

Clones a GitHub repository containing artifacts from a scientific paper which implements cryptographic code optimized for ARMv8 CPUs, and associated benchmarking code using the CPU's cycle counter.
Modifies the benchmarking code in a way that exhibits the issue to be described. The modifications are as follows:
- restricts the benchmark to a single function, removing the others;
- move some of the local variables in main() to the global scope;
- introduces some logic to read a command-line argument to control the number of iterations of the benchmark (although this isn't actually used in the test itself; however, the presence of this code appears necessary to trigger the issue).
Compiles the benchmark (speed_polymul);
Creates a copy of speed_polymul with a different name (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ) -- this is an identical byte-per-byte copy using the cp command-line utility
Runs each of the binaries 5 times in a row

Result

The relevant part of the output (step 5) above is always similar to this:

Running speed_polymul 5 times
polymul (Z_q[x] / (x^677 - 1)): 6881 
polymul (Z_q[x] / (x^677 - 1)): 6916 
polymul (Z_q[x] / (x^677 - 1)): 6864 
polymul (Z_q[x] / (x^677 - 1)): 6883 
polymul (Z_q[x] / (x^677 - 1)): 6881 
Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
polymul (Z_q[x] / (x^677 - 1)): 7179 
polymul (Z_q[x] / (x^677 - 1)): 7186 
polymul (Z_q[x] / (x^677 - 1)): 7177 
polymul (Z_q[x] / (x^677 - 1)): 7195 
polymul (Z_q[x] / (x^677 - 1)): 7179

There is a clear difference of ~5% in performance, which is fully reproducible for me. Note this is CPU-only code, which doesn't perform any kind of I/O.

It would be natural to think this is related to running the benchmark only once (actually 1000 times -- see NTESTS in speed_polymul.c), but I also have a version of the code which runs it in a loop (each one an average of 1000 runs, as just stated), and the issue equally happens if I run it for 100,000 iterations. The version above is the simplest one I could create that reliably reproduces the problem for me.

Context

I have a project which generates many executables -- which, obviously have distinct file names -- to benchmark identical code with different inputs, for a variable number of iterations, passed as a command-line parameter. So I am being bitten by this issue, as different file names running exactly the same code, only for different inputs (and the inputs do not matter, as this is constant-time crypto code), produce different cycle counts.

Also this isn't an issue of "maybe the crypto code isn't constant time after all and execution time varies with the inputs", since I can just make a bit-for-bit identical copy of the executable with a different name (exactly as the script does), and I get these unquestionably different cycle counts in each case.

Questions

What could possibly explain this difference?
How do I get rid of it, getting stable results regardless of file name?

Hypothesis

My hypothesis for the difference is that this is related to the size of the buffer pointed to by argv[0] in main(), and that this is somehow triggering some kind of alignment issue on the variables used by the code. However, this still needs more investigation to pinpoint the exact chain of events leading to this issue.

To that end, I can confirm that argv[0] indeed changes. I have added a printf() statement to print the addresses of all variables in the code. These are the outputs when running the code with two different names that produce different results:

&time0 = 0x10071b428 &time1 = 0x10071b430 cycles = 0x100718408 argv = 0x16f703b18 argv[0] = 0x16f703c10 &src1 = 0x10071a8e8 &src2 = 0x10071ae88 &des = 0x10071a348
polymul (Z_q[x] / (x^677 - 1)): 7242

&time0 = 0x102b13428 &time1 = 0x102b13430 cycles = 0x102b10408 argv = 0x16d30bb08 argv[0] = 0x16d30bc08 &src1 = 0x102b128e8 &src2 = 0x102b12e88 &des = 0x102b12348
polymul (Z_q[x] / (x^677 - 1)): 6878

Looking at the addresses modulo 256 (i.e. the last two nibbles in each address), the only difference is in argv and argv[0]. However, it is unclear why this would affect either the function being benchmarked (which only depends on src1, src2 and des) or the code that measures clock cycles (which depends on time0, time1 and cycles, but not argv or argv[0]).

Solution

Your hypothesis (triggered by buffer alignment) sounds plausible. But I think your expectations are too high. A systematic difference of 5% in runtime can happen for any reason not under your control.

It's a bit old, but have a look at the Stabilizer slides (or this review of their paper). They specifically mention argc/argv for affecting the stack layout. And as you benchmark to make your code faster, you will compare different compilations. You should expect that any unrelated change - in your code, the OS, compiler, CPU, or a library - can reverse a 5% difference that you measured earlier.

From the review cited above:

The SPECint 2000 “perl” benchmark (which is just the Perl interpreter running some – completely deterministic! – code) is infamous for having +-5% swings (this happens for multiple architectures) whenever anything in the executable changes.

Update: a more recent article, Performance Roulette, has a deep-dive how function alignment can cause this.

Your second question, how to get stable results: I think Stabilizer got that right, you have to randomize your benchmark and average over multiple runs. Try to trigger the effect somehow else, e.g. by calling alloca() or malloc() in main() to randomize stack/heap position. Then average it out, such that the executable name no longer matters.

There is also Coz, the causal profiler, by the same authors as Stabilizer. I haven't used it myself, but it seems relevant. (Update: On second glance, Coz is about multi-threading, it doesn't seem to do anything about stack layout, so it wouldn't have helped.)

Alternatively you could take an engineering stance, and whenever you end up with a 5% difference you say it's within tolerance and pick the more readable version.