linux-kernel benchmarking scheduler perf

Getting accurate time measurement with `perf-stat`

I try to benchmark simple "hello, world" programs written in several languages. I am n00b in this field, and the perf-stat manual is not making it easy to accomplish by lack of examples.

To do that I run perf-stat at high priority to avoid context switching. So I come up with something like that:

sudo chrt -f 99 perf stat -e cs -e cpu-clock ./hello_c

However, the results I get for the same program differ significantly. For instance the result for the same C compiled executable can be:

     0      cs                        #    0.000 K/sec                  
  0.42 msec cpu-clock                 #    0.612 CPUs utilized          

0.000694107 seconds time elapsed
0.000713000 seconds user
0.000000000 seconds sys

     0      cs                        #    0.000 K/sec                  
  0.58 msec cpu-clock                 #    0.620 CPUs utilized          

0.000936635 seconds time elapsed
0.000000000 seconds user
0.000940000 seconds sys

I this particular example there is 0.242528 msec of incompatibility even when context switches are in both examples equal 0.

Is there something I am missing, some calculation I need to do? Or it is not possible to get any closer results? Are there any other options to fix this problem then getting average of n executions?

Solution

There are a variety of reasons you can see variation when you repeatedly benchmark what appears to be the same code. I have covered some of the reasons in another answer and it would be worthwhile to keep those in mind.

However, based on experience and playing the probabilities, we can eliminate many of those up front. What's left are the most likely causes of your relatively large deviations for short programs from a cold start:

CPU power saving and frequency scaling features.
Actual runtime behavior differences, i.e., different code executed in the runtime library, VM, OS or other supporting infrastructure each time your run your program.
Some caching effect, or code or data alignment effect that varies from run to run.

You can probably separate these three effects with a plain perf stat without overriding the event list, like:

$ perf stat true

 Performance counter stats for 'true':

          0.258367      task-clock (msec)         #    0.427 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                41      page-faults               #    0.159 M/sec                  
           664,570      cycles                    #    2.572 GHz                    
           486,817      instructions              #    0.73  insn per cycle         
            92,503      branches                  #  358.029 M/sec                  
             3,978      branch-misses             #    4.30% of all branches        

       0.000605076 seconds time elapsed

Look first at the 2.572 GHz line. This shows the effective CPU frequency, calculating by dividing the true number of CPU cycles by the task-clock value (CPU time spent by the program). If this varies from run to run, the wall-clock time performance deviation is partly or completely explained by this change, and the most likely cause is (1) above, i.e., CPU frequency scaling, including both scaling below nominal frequency (power saving), and above (turbo boost or similar features).

The details of disabling frequency scaling depends on the hardware, but a common one that works on most modern Linux distributions is cpupower -c all frequency-set -g performance to inhibit below-nominal scaling.

Disabling turbo boost is more complicated and may depend on the hardware platform and even the specific CPU, but for recent x86 some options include:

Writing 0 to /sys/devices/system/cpu/intel_pstate/no_turbo (Intel only)
Doing a wrmsr -p${core} 0x1a0 0x4000850089 for each ${core} in your system (although one on each socket is probably enough on some/most/all chips?). (Intel only)
Adjust the /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq value to set a maximum frequency.
Use the userspace governor and /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed to set a fixed frequency.

Another option is to simply run your test repeatedly, and hope that the CPU quickly reaches a steady state. perf stat has built-in support for that with the --repeat=N option:

   -r, --repeat=<n>
       repeat command and print average + stddev (max: 100). 0 means forever.

Let's say you observe that the frequency is always the same (within 1% or so), or you have fixed the frequency issues but some variance remains.

Next, check the instructions line. This is a rough indicator of how much total work your program is doing. If it varies in the same direction and similar relative variance to your runtime variance, you have a problem of type (2): some runs are doing more work than others. Without knowing what your program is, it would be hard to say more, but you can use tools like strace, perf record + perf annotate to track that down.

If instructions doesn't vary, and frequency is fixed, but runtime varies, you have a problem of type (3) or "other". You'll want to look at more performance counters to see which correlate with the slower runs: are you having more cache misses? More context switches? More branch mispredictions? The list goes on. Once you find out what is slowing you down, you can try to isolate the code that is causing it. You can also go the other direction: using traditional profiling to determine what part of the code slows down on the slow runs.

Good luck!