I wrote a benchmark to test some particular functionality. Running the benchmark typically gave consistent results, but roughly one out of ten times it appeared to run something like 3x times faster in every benchmarking test case.
I wondered if there was some kind of branch prediction or cache locality issue affecting this, so I ran it in perf
, like so:
sudo perf stat -B -e cache-references,cache-misses,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses ./my_benchmark
Now the results are reversed: roughly nine out of ten times it runs faster, in which case the perf stat
output looks like so:
Performance counter stats for './my_benchmark':
336,011 cache-references # 75.756 M/sec (41.40%)
74,722 cache-misses # 22.238 % of all cache refs
4.435442 task-clock (msec) # 0.964 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.129 M/sec
13,745,945 cycles # 3.099 GHz
16,521,518 instructions # 1.20 insn per cycle
4,453,340 branches # 1004.035 M/sec
91,336 branch-misses # 2.05% of all branches (58.60%)
0.004603313 seconds time elapsed
And in roughly one out of ten trials it runs 3x slower, showing results like so:
Performance counter stats for './my_benchmark':
348,441 cache-references # 22.569 M/sec (74.14%)
112,153 cache-misses # 32.187 % of all cache refs (74.14%)
15.439061 task-clock (msec) # 0.965 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
572 page-faults # 0.037 M/sec
13,717,144 cycles # 0.888 GHz (62.52%)
16,951,632 instructions # 1.24 insn per cycle (88.40%)
4,463,213 branches # 289.086 M/sec
70,185 branch-misses # 1.57% of all branches (89.20%)
0.015999175 seconds time elapsed
I notice that the task always seems to complete in roughly the same number of cycles, but that the frequencies are different -- in the "fast" case it shows something like 3GHz, whereas in the slow case it shows something like 900 MHz. I don't know explicitly what this stat means, though, so I don't know if this is just a tautological consequence of the similar number of cycles and longer runtime or whether it means the processor's clock is actually running at a different speed.
I do notice that in both cases it says "context switches: 0" and "cpu migrations: 0," so it doesn't look like the slowdown is coming from the benchmark being preempted.
What is going on here, and can I write (or launch?) my program in such a way that I always get the faster performance?
often CPU frequency is variable based on load ... I would force a freq lock prior to running this
what OS are you on ?