Perf: Could not find an useful description of "branch-load-misses" metric

I'm trying to show that the stalls due to branch misprediction may be reduced due to a certain optimization. My intuition suggests this could be due to a reduction in the stall cycles related to loads that delay the branch outcome.

For this, I was planning to use the Linux Perf utility to get the Hardware performance counter values. There is a related metric called branch-load-misses, however, no useful description is provided.

Can anybody please confirm if this is the right metric to use? If not, please suggest a related metric that could be of help.

Thank you

Solution

Linux perf has the branches and branch-misses counters, on Intel x86 these map to BR_INST_RETIRED.ALL_BRANCHES and BR_MISP_RETIRED.ALL_BRANCHES which measure all retired branches, and all retired mispredicted branches, respectively.

perf list also includes branch-loads and branch-load-misses but no explanation of what they do. Weirdly, the kernel sources only reference them in the context of PowerPC [1]. On x86, it seems they are just mapped to branches and branch-misses as they return identical values:

$ perf stat -e branches,branch-misses,branch-loads,branch-load-misses -- /bin/true

 Performance counter stats for '/bin/true':

           415,881      branches
             8,787      branch-misses             #    2.11% of all branches
           415,881      branch-loads
             8,787      branch-load-misses

Regarding your original question, keep in mind that the impact of branches comes from two components: the number of mispredicted branches, and the branch resolution time (time to compute the actual branch outcome, which potentially depends on long-latency loads). The former can be measured using the branch-misses event. To quantify the latter, you may be better of with something like TopDown analysis [2].

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/perf/generic-compat-pmu.c

[2] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis