I'm trying to show that the stalls due to branch misprediction may be reduced due to a certain optimization. My intuition suggests this could be due to a reduction in the stall cycles related to loads that delay the branch outcome.
For this, I was planning to use the Linux Perf utility to get the Hardware performance counter values. There is a related metric called branch-load-misses, however, no useful description is provided.
Can anybody please confirm if this is the right metric to use? If not, please suggest a related metric that could be of help.
Thank you
Linux perf
has the branches
and branch-misses
counters, on Intel x86 these map to BR_INST_RETIRED.ALL_BRANCHES
and BR_MISP_RETIRED.ALL_BRANCHES
which measure all retired branches, and all retired mispredicted branches, respectively.
perf list
also includes branch-loads
and branch-load-misses
but no explanation of what they do. Weirdly, the kernel sources only reference them in the context of PowerPC [1]. On x86, it seems they are just mapped to branches
and branch-misses
as they return identical values:
$ perf stat -e branches,branch-misses,branch-loads,branch-load-misses -- /bin/true
Performance counter stats for '/bin/true':
415,881 branches
8,787 branch-misses # 2.11% of all branches
415,881 branch-loads
8,787 branch-load-misses
Regarding your original question, keep in mind that the impact of branches comes from two components: the number of mispredicted branches, and the branch resolution time (time to compute the actual branch outcome, which potentially depends on long-latency loads). The former can be measured using the branch-misses
event. To quantify the latter, you may be better of with something like TopDown analysis [2].
[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/perf/generic-compat-pmu.c
[2] https://perf.wiki.kernel.org/index.php/Top-Down_Analysis