Search code examples
performanceprofilingperformance-testingperfmeasurement

Formulas in perf stat


I am wondering about the formulas used in perf stat to calculate figures from the raw data.

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp

    1080267.226401      task-clock (msec)         #   19.062 CPUs utilized          
 1,592,123,216,789      cycles                    #    1.474 GHz                      (50.00%)
   871,190,006,655      instructions              #    0.55  insn per cycle           (75.00%)
     3,697,548,810      cache-references          #    3.423 M/sec                    (75.00%)
       459,457,321      cache-misses              #   12.426 % of all cache refs      (75.00%)

In this context, how do you calculate M/sec from cache-references?


Solution

  • Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):

    http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626

    When HW_INSTRUCTIONS is counted: "Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS

    if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
        total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
        if (total) {
            ratio = avg / total;
            print_metric(ctxp, NULL, "%7.2f ",
                    "insn per cycle", ratio);
        } else {
            print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
        }
    

    Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS

    There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).

    Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES

    GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)

    if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
        perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
        update_stats(&runtime_nsecs_stats[cpu], count[0]);
    

    There are also several formulas for transactions (perf stat -T mode).

    "CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):

    static inline unsigned long long rdclock(void)
    {
        struct timespec ts;
    
        clock_gettime(CLOCK_MONOTONIC, &ts);
        return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
    }
    
    ...
    
    static int __run_perf_stat(int argc, const char **argv)
    {    
    ...
        /*
         * Enable counters and exec the command:
         */
        t0 = rdclock();
        clock_gettime(CLOCK_MONOTONIC, &ref_time);
        if (forks) {
            ....
        }
        t1 = rdclock();
    
        update_stats(&walltime_nsecs_stats, t1 - t0);
    

    There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).

    And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)

    } else if (runtime_nsecs_stats[cpu].n != 0) {
        char unit = 'M';
        char unit_buf[10];
    
        total = avg_stats(&runtime_nsecs_stats[cpu]);
    
        if (total)
            ratio = 1000.0 * avg / total;
        if (ratio < 0.001) {
            ratio *= 1000;
            unit = 'K';
        }
        snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
        print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
    }