Search code examples
linuxperformanceperfintel-vtune

Perf: Could not find an useful description of "branch-load-misses" metric


I'm trying to show that the stalls due to branch misprediction may be reduced due to a certain optimization. My intuition suggests this could be due to a reduction in the stall cycles related to loads that delay the branch outcome.

For this, I was planning to use the Linux Perf utility to get the Hardware performance counter values. There is a related metric called branch-load-misses, however, no useful description is provided.

Can anybody please confirm if this is the right metric to use? If not, please suggest a related metric that could be of help.

Thank you


Solution

  • Linux perf has the branches and branch-misses counters, on Intel x86 these map to BR_INST_RETIRED.ALL_BRANCHES and BR_MISP_RETIRED.ALL_BRANCHES which measure all retired branches, and all retired mispredicted branches, respectively.

    perf list also includes branch-loads and branch-load-misses but no explanation of what they do. Weirdly, the kernel sources only reference them in the context of PowerPC [1]. On x86, it seems they are just mapped to branches and branch-misses as they return identical values:

    $ perf stat -e branches,branch-misses,branch-loads,branch-load-misses -- /bin/true
    
     Performance counter stats for '/bin/true':
    
               415,881      branches
                 8,787      branch-misses             #    2.11% of all branches
               415,881      branch-loads
                 8,787      branch-load-misses
    

    Regarding your original question, keep in mind that the impact of branches comes from two components: the number of mispredicted branches, and the branch resolution time (time to compute the actual branch outcome, which potentially depends on long-latency loads). The former can be measured using the branch-misses event. To quantify the latter, you may be better of with something like TopDown analysis [2].

    [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/perf/generic-compat-pmu.c

    [2] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis