When using dhrystone to get DMIPS, I found that LTO greatly impacted the results. LTO-dhrystone is nearly 4x LTO-less-dhrystone:
$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 5234421.7
VAX MIPS rating = 2979.181
Performance counter stats for './dhrystone':
19,158.53 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
547 page-faults:u # 28.551 /sec
81,470,643,102 cycles:u # 4.252 GHz (50.01%)
3,046,747 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.02%)
37,208,106,969 stalled-cycles-backend:u # 45.67% backend cycles idle (50.00%)
319,848,969,156 instructions:u # 3.93 insn per cycle
# 0.12 stalled cycles per insn (49.99%)
49,311,879,609 branches:u # 2.574 G/sec (49.98%)
317,518 branch-misses:u # 0.00% of all branches (50.00%)
19.762244278 seconds time elapsed
19.118127000 seconds user
0.004017000 seconds sys
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 19539623.0
VAX MIPS rating = 11121.015
Performance counter stats for './dhrystone':
5,146.69 msec task-clock:u # 0.908 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
553 page-faults:u # 107.448 /sec
21,453,263,692 cycles:u # 4.168 GHz (50.00%)
1,574,543 stalled-cycles-frontend:u # 0.01% frontend cycles idle (50.03%)
12,575,396,819 stalled-cycles-backend:u # 58.62% backend cycles idle (50.04%)
89,186,371,586 instructions:u # 4.16 insn per cycle
# 0.14 stalled cycles per insn (50.00%)
7,717,732,872 branches:u # 1.500 G/sec (49.97%)
353,303 branch-misses:u # 0.00% of all branches (49.96%)
5.666446006 seconds time elapsed
5.133037000 seconds user
0.003322000 seconds sys
As you can see
1953,9623.0
and LTO-less dhrystone is 523,4421.7
89,186,371,586
instructions and LTO-less dhrystone executes 319,848,969,156
I think the root cause is that LTO reduces many instructions, so it can run much faster.
But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.
LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h
for inlining normally, LTO can greatly simplify code that does a lot of calling such functions.
A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl
and ret
instructions. And would have to respect the calling convention, so the call-site might need to mov
some values to call-preserved registers. But when inlining, the compiler has full control over all the registers.
For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations. (e.g. hoisting work out of loops instead of re-computing something every time.)
Unless you use LTO so it can break your benchmarks. (Or maybe there's another reason with dhrystone, IDK.)