Run the following shell script on an Apple Silicon Mac (tried on a MacBook Air 2020 with M1 and a MacBook Pro 16" with M3 Max; both running macOS Sonoma, with Apple clang 15.0):
#!/bin/bash
rm -rf NTRU
git clone https://github.com/vector-polymul-ntru-ntrup/NTRU.git
cd NTRU/ntruhps2048677/aarch64_tmvp
cat << EOF > speed_polymul.c
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "api.h"
#include "params.h"
#include "poly.h"
#include "tmvp.h"
#include "batch_multiplication.h"
#define NTESTS 1000
uint64_t time0, time1;
uint64_t cycles[NTESTS];
#ifdef __APPLE__
#include "m1cycles.h"
#define __AVERAGE__
#define SETUP_COUNTER() {(void)cycles; setup_rdtsc();}
#define CYCLE_TYPE "%lld"
#define GET_TIME rdtsc()
#else
#include "hal.h"
#undef __AVERAGE__
#define __MEDIAN__
#define SETUP_COUNTER() {}
#define CYCLE_TYPE "%ld"
#define GET_TIME hal_get_time()
static int cmp_uint64(const void *a, const void *b){
return ((*((const uint64_t*)a)) - ((*((const uint64_t*)b))));
}
#endif
#ifdef __AVERAGE__
#define LOOP_INIT(__clock0, __clock1) { \
__clock0 = GET_TIME; \
}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
__clock1 = GET_TIME; \
printf(__f_string, (__clock1 - __clock0) / NTESTS); \
}
#define BODY_INIT(__clock0, __clock1) {}
#define BODY_TAIL(records, __clock0, __clock1) {}
#elif defined(__MEDIAN__)
#define LOOP_INIT(__clock0, __clock1) {}
#define LOOP_TAIL(__f_string, records, __clock0, __clock1) { \
qsort(records, sizeof(uint64_t), NTESTS, cmp_uint64); \
printf(__f_string, records[NTESTS >> 1]); \
}
#define BODY_INIT(__clock0, __clock1) { \
__clock0 = GET_TIME; \
}
#define BODY_TAIL(records, __clock0, __clock1) { \
__clock1 = GET_TIME; \
records[i] = __clock1 - __clock0; \
}
#endif
#define WRAP_FUNC(__f_string, records, __clock0, __clock1, func){ \
LOOP_INIT(__clock0, __clock1); \
for(size_t i = 0; i < NTESTS; i++){ \
BODY_INIT(__clock0, __clock1); \
func; \
BODY_TAIL(records, __clock0, __clock1); \
} \
LOOP_TAIL(__f_string, records, __clock0, __clock1); \
}
poly src1, src2;
poly des;
int main(int argc, char *argv[])
{
int iters = 1;
SETUP_COUNTER();
if (argc > 1) {
iters = atoi(argv[1]);
}
if (iters == 1) {
WRAP_FUNC("polymul (Z_q[x] / (x^677 - 1)): " CYCLE_TYPE " \n",
cycles, time0, time1,
poly_Rq_mul(&(des), &(src1), &(src2)));
}
return 0;
}
EOF
make -j16 speed_polymul
cp speed_polymul abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
echo Type your password for sudo now
sudo echo Password OK
sudo ./speed_polymul > /dev/null # First run is slower (possibly has to do with verifying code signature)
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ > /dev/null # First run is slower (possibly has to do with verifying code signature)
echo Running speed_polymul 5 times
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
sudo ./speed_polymul
echo Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
sudo ./abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
(I apologize for the need for sudo
, but the cycle counters in Apple Silicon CPUs can only be accessed as root. I understand if anyone is paranoid and would rather not run this code.)
What this script does is:
main()
to the global scope;speed_polymul
);speed_polymul
with a different name (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
) -- this is an identical byte-per-byte copy using the cp
command-line utilityThe relevant part of the output (step 5) above is always similar to this:
Running speed_polymul 5 times
polymul (Z_q[x] / (x^677 - 1)): 6881
polymul (Z_q[x] / (x^677 - 1)): 6916
polymul (Z_q[x] / (x^677 - 1)): 6864
polymul (Z_q[x] / (x^677 - 1)): 6883
polymul (Z_q[x] / (x^677 - 1)): 6881
Running abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 5 times
polymul (Z_q[x] / (x^677 - 1)): 7179
polymul (Z_q[x] / (x^677 - 1)): 7186
polymul (Z_q[x] / (x^677 - 1)): 7177
polymul (Z_q[x] / (x^677 - 1)): 7195
polymul (Z_q[x] / (x^677 - 1)): 7179
There is a clear difference of ~5% in performance, which is fully reproducible for me. Note this is CPU-only code, which doesn't perform any kind of I/O.
It would be natural to think this is related to running the benchmark only once (actually 1000 times -- see NTESTS
in speed_polymul.c
), but I also have a version of the code which runs it in a loop (each one an average of 1000 runs, as just stated), and the issue equally happens if I run it for 100,000 iterations. The version above is the simplest one I could create that reliably reproduces the problem for me.
I have a project which generates many executables -- which, obviously have distinct file names -- to benchmark identical code with different inputs, for a variable number of iterations, passed as a command-line parameter. So I am being bitten by this issue, as different file names running exactly the same code, only for different inputs (and the inputs do not matter, as this is constant-time crypto code), produce different cycle counts.
Also this isn't an issue of "maybe the crypto code isn't constant time after all and execution time varies with the inputs", since I can just make a bit-for-bit identical copy of the executable with a different name (exactly as the script does), and I get these unquestionably different cycle counts in each case.
My hypothesis for the difference is that this is related to the size of the buffer pointed to by argv[0]
in main()
, and that this is somehow triggering some kind of alignment issue on the variables used by the code. However, this still needs more investigation to pinpoint the exact chain of events leading to this issue.
To that end, I can confirm that argv[0]
indeed changes. I have added a printf()
statement to print the addresses of all variables in the code. These are the outputs when running the code with two different names that produce different results:
&time0 = 0x10071b428 &time1 = 0x10071b430 cycles = 0x100718408 argv = 0x16f703b18 argv[0] = 0x16f703c10 &src1 = 0x10071a8e8 &src2 = 0x10071ae88 &des = 0x10071a348
polymul (Z_q[x] / (x^677 - 1)): 7242
&time0 = 0x102b13428 &time1 = 0x102b13430 cycles = 0x102b10408 argv = 0x16d30bb08 argv[0] = 0x16d30bc08 &src1 = 0x102b128e8 &src2 = 0x102b12e88 &des = 0x102b12348
polymul (Z_q[x] / (x^677 - 1)): 6878
Looking at the addresses modulo 256 (i.e. the last two nibbles in each address), the only difference is in argv
and argv[0]
. However, it is unclear why this would affect either the function being benchmarked (which only depends on src1
, src2
and des
) or the code that measures clock cycles (which depends on time0
, time1
and cycles
, but not argv
or argv[0]
).
Your hypothesis (triggered by buffer alignment) sounds plausible. But I think your expectations are too high. A systematic difference of 5% in runtime can happen for any reason not under your control.
It's a bit old, but have a look at the Stabilizer slides (or this review of their paper). They specifically mention argc/argv for affecting the stack layout. And as you benchmark to make your code faster, you will compare different compilations. You should expect that any unrelated change - in your code, the OS, compiler, CPU, or a library - can reverse a 5% difference that you measured earlier.
From the review cited above:
The SPECint 2000 “perl” benchmark (which is just the Perl interpreter running some – completely deterministic! – code) is infamous for having +-5% swings (this happens for multiple architectures) whenever anything in the executable changes.
Update: a more recent article, Performance Roulette, has a deep-dive how function alignment can cause this.
Your second question, how to get stable results: I think Stabilizer got that right, you have to randomize your benchmark and average over multiple runs. Try to trigger the effect somehow else, e.g. by calling alloca()
or malloc()
in main() to randomize stack/heap position. Then average it out, such that the executable name no longer matters.
There is also Coz, the causal profiler, by the same authors as Stabilizer. I haven't used it myself, but it seems relevant. (Update: On second glance, Coz is about multi-threading, it doesn't seem to do anything about stack layout, so it wouldn't have helped.)
Alternatively you could take an engineering stance, and whenever you end up with a 5% difference you say it's within tolerance and pick the more readable version.