Search code examples
linuxprofilingperf

Does perf stat -C X <command> run the command on core X?


I want to profile a command, say ls, on a single core. I can use the -C flag of perf stat to specify which core to profile, but does ls actually run on the core I choose here?

Attempting perf stat -C 7 ls, I get wildly different cycle counts ranging from 150k to 5 million.

I can force ls to run on a specific core with taskset, eg perf stat -C 7 -A taskset --cpu-list 7 ls, but I still get wildly different cycle counts each run - although there does seem to be less variation (from 2-6 million cycles). Of course, taskset would have some overhead here. Is this the correct thing to do to obtain as accurate results as possible?


Solution

  • No, it doesn't. It counts events on that CPU whether any threads from your task happen to be running on it or not!

    You can use taskset -c 7 perf stat ... if you don't mind perf itself also running on that CPU core, to avoid profiling taskset. perf stat has very little if any overhead so it's not a problem that it's on the same core as the workload while it's counting.

    perf stat -C doesn't imply -a according to the man page, so it's surprising you don't get zero counts more of the time (with the process being profiled not running on the selected CPU core at all).

    /bin/ls is a very short-lived workload that spends most of its time in system calls, so it's a weird choice of something to profile. 4 million cycles is only 1 millisecond on a 4GHz CPU. And much of it is probably spent in kernel code for getdents, so you'd expect high variability anyway if you aren't using --all-user or -e instructions:u,cycles:u and so on.

    A normal run of simple workload that uses some CPU time looks like this, on i7-6700k with Linux 5.16:

    $ taskset -c 4 perf stat --all-user awk 'BEGIN{for(i=0;i<10000000;i++){}}'
    
     Performance counter stats for 'awk BEGIN{for(i=0;i<10000000;i++){}}':
    
                331.11 msec task-clock                #    0.999 CPUs utilized          
                     0      context-switches          #    0.000 /sec                   
                     0      cpu-migrations            #    0.000 /sec                   
                   177      page-faults               #  534.559 /sec                   
         1,371,512,156      cycles                    #    4.142 GHz                    
         3,582,591,466      instructions              #    2.61  insn per cycle         
           970,439,895      branches                  #    2.931 G/sec                  
                22,558      branch-misses             #    0.00% of all branches        
    
           0.331526126 seconds time elapsed
    
           0.328034000 seconds user
           0.003313000 seconds sys
    

    But 10 back-to-back runs counting user-space-only for a CPU other than the one it's pinned on counts wildly varying numbers of instructions and cycles. (Note the variance of well over 100%.) Not sure what exactly instructions it could be counting, like I said I expected this to be zero.

    $ taskset -c 4 perf stat --all-user -r10 -C 3 awk 'BEGIN{for(i=0;i<10000000;i++){}}'
    
     Performance counter stats for 'CPU(s) 3' (10 runs):
    
                329.45 msec cpu-clock                 #    0.999 CPUs utilized            ( +-  0.06% )
                     0      context-switches          #    0.000 /sec                   
                     0      cpu-migrations            #    0.000 /sec                   
                     0      page-faults               #    0.000 /sec                   
             2,692,718      cycles                    #    0.008 GHz                      ( +-124.70% )
             1,875,435      instructions              #    0.24  insn per cycle           ( +-241.60% )
               358,646      branches                  #    1.088 M/sec                    ( +-254.22% )
                12,917      branch-misses             #    0.68% of all branches          ( +- 70.77% )
    
              0.329648 +- 0.000198 seconds time elapsed  ( +-  0.06% )
    

    Instructions varied from 139k to 3767k over a few runs, and not always the same IPC, sometimes like 1.0, but many others 0.25 +- 0.05