I try to benchmark simple "hello, world" programs written in several languages. I am n00b in this field, and the perf-stat
manual is not making it easy to accomplish by lack of examples.
To do that I run perf-stat
at high priority to avoid context switching. So I come up with something like that:
sudo chrt -f 99 perf stat -e cs -e cpu-clock ./hello_c
However, the results I get for the same program differ significantly. For instance the result for the same C compiled executable can be:
0 cs # 0.000 K/sec
0.42 msec cpu-clock # 0.612 CPUs utilized
0.000694107 seconds time elapsed
0.000713000 seconds user
0.000000000 seconds sys
or
0 cs # 0.000 K/sec
0.58 msec cpu-clock # 0.620 CPUs utilized
0.000936635 seconds time elapsed
0.000000000 seconds user
0.000940000 seconds sys
I this particular example there is 0.242528 msec
of incompatibility even when context switches are in both examples equal 0
.
Is there something I am missing, some calculation I need to do? Or it is not possible to get any closer results? Are there any other options to fix this problem then getting average of n executions?
There are a variety of reasons you can see variation when you repeatedly benchmark what appears to be the same code. I have covered some of the reasons in another answer and it would be worthwhile to keep those in mind.
However, based on experience and playing the probabilities, we can eliminate many of those up front. What's left are the most likely causes of your relatively large deviations for short programs from a cold start:
You can probably separate these three effects with a plain perf stat
without overriding the event list, like:
$ perf stat true
Performance counter stats for 'true':
0.258367 task-clock (msec) # 0.427 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
41 page-faults # 0.159 M/sec
664,570 cycles # 2.572 GHz
486,817 instructions # 0.73 insn per cycle
92,503 branches # 358.029 M/sec
3,978 branch-misses # 4.30% of all branches
0.000605076 seconds time elapsed
Look first at the 2.572 GHz
line. This shows the effective CPU frequency, calculating by dividing the true number of CPU cycles by the task-clock
value (CPU time spent by the program). If this varies from run to run, the wall-clock time performance deviation is partly or completely explained by this change, and the most likely cause is (1) above, i.e., CPU frequency scaling, including both scaling below nominal frequency (power saving), and above (turbo boost or similar features).
The details of disabling frequency scaling depends on the hardware, but a common one that works on most modern Linux distributions is cpupower -c all frequency-set -g performance
to inhibit below-nominal scaling.
Disabling turbo boost is more complicated and may depend on the hardware platform and even the specific CPU, but for recent x86 some options include:
0
to /sys/devices/system/cpu/intel_pstate/no_turbo
(Intel only)wrmsr -p${core} 0x1a0 0x4000850089
for each ${core}
in your system (although one on each socket is probably enough on some/most/all chips?). (Intel only)/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
value to set a maximum frequency.userspace
governor and /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
to set a fixed frequency.Another option is to simply run your test repeatedly, and hope that the CPU quickly reaches a steady state. perf stat
has built-in support for that with the --repeat=N
option:
-r, --repeat=<n>
repeat command and print average + stddev (max: 100). 0 means forever.
Let's say you observe that the frequency is always the same (within 1% or so), or you have fixed the frequency issues but some variance remains.
Next, check the instructions
line. This is a rough indicator of how much total work your program is doing. If it varies in the same direction and similar relative variance to your runtime variance, you have a problem of type (2): some runs are doing more work than others. Without knowing what your program is, it would be hard to say more, but you can use tools like strace
, perf record + perf annotate
to track that down.
If instructions
doesn't vary, and frequency is fixed, but runtime varies, you have a problem of type (3) or "other". You'll want to look at more performance counters to see which correlate with the slower runs: are you having more cache misses? More context switches? More branch mispredictions? The list goes on. Once you find out what is slowing you down, you can try to isolate the code that is causing it. You can also go the other direction: using traditional profiling to determine what part of the code slows down on the slow runs.
Good luck!