Why does my benchmark randomly appear to run 3x as fast, and why does running it inside perf make this happen much more often?

I wrote a benchmark to test some particular functionality. Running the benchmark typically gave consistent results, but roughly one out of ten times it appeared to run something like 3x times faster in every benchmarking test case.

I wondered if there was some kind of branch prediction or cache locality issue affecting this, so I ran it in perf, like so:

sudo perf stat -B -e cache-references,cache-misses,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses ./my_benchmark

Now the results are reversed: roughly nine out of ten times it runs faster, in which case the perf stat output looks like so:

 Performance counter stats for './my_benchmark':

           336,011      cache-references          #   75.756 M/sec                    (41.40%)
            74,722      cache-misses              #   22.238 % of all cache refs    
          4.435442      task-clock (msec)         #    0.964 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               572      page-faults               #    0.129 M/sec                  
        13,745,945      cycles                    #    3.099 GHz                    
        16,521,518      instructions              #    1.20  insn per cycle         
         4,453,340      branches                  # 1004.035 M/sec                  
            91,336      branch-misses             #    2.05% of all branches          (58.60%)

       0.004603313 seconds time elapsed

And in roughly one out of ten trials it runs 3x slower, showing results like so:

 Performance counter stats for './my_benchmark':

           348,441      cache-references          #   22.569 M/sec                    (74.14%)
           112,153      cache-misses              #   32.187 % of all cache refs      (74.14%)
         15.439061      task-clock (msec)         #    0.965 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               572      page-faults               #    0.037 M/sec                  
        13,717,144      cycles                    #    0.888 GHz                      (62.52%)
        16,951,632      instructions              #    1.24  insn per cycle           (88.40%)
         4,463,213      branches                  #  289.086 M/sec                  
            70,185      branch-misses             #    1.57% of all branches          (89.20%)

       0.015999175 seconds time elapsed

I notice that the task always seems to complete in roughly the same number of cycles, but that the frequencies are different -- in the "fast" case it shows something like 3GHz, whereas in the slow case it shows something like 900 MHz. I don't know explicitly what this stat means, though, so I don't know if this is just a tautological consequence of the similar number of cycles and longer runtime or whether it means the processor's clock is actually running at a different speed.

I do notice that in both cases it says "context switches: 0" and "cpu migrations: 0," so it doesn't look like the slowdown is coming from the benchmark being preempted.

What is going on here, and can I write (or launch?) my program in such a way that I always get the faster performance?

Solution

often CPU frequency is variable based on load ... I would force a freq lock prior to running this

what OS are you on ?