linker compiler-construction benchmarking compiler-optimization lto

What kind of program can benefit much from LTO?

When using dhrystone to get DMIPS, I found that LTO greatly impacted the results. LTO-dhrystone is nearly 4x LTO-less-dhrystone:

$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src

without LTO

$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static  dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected?  YES
Microseconds for one run through Dhrystone:     0.2 
Dhrystones per Second:                       5234421.7 
VAX MIPS rating =   2979.181 


 Performance counter stats for './dhrystone':

         19,158.53 msec task-clock:u              #    0.969 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               547      page-faults:u             #   28.551 /sec                   
    81,470,643,102      cycles:u                  #    4.252 GHz                      (50.01%)
         3,046,747      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (50.02%)
    37,208,106,969      stalled-cycles-backend:u  #   45.67% backend cycles idle      (50.00%)
   319,848,969,156      instructions:u            #    3.93  insn per cycle         
                                                  #    0.12  stalled cycles per insn  (49.99%)
    49,311,879,609      branches:u                #    2.574 G/sec                    (49.98%)
           317,518      branch-misses:u           #    0.00% of all branches          (50.00%)

      19.762244278 seconds time elapsed

      19.118127000 seconds user
       0.004017000 seconds sys

With LTO

$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static  dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected?  YES
Microseconds for one run through Dhrystone:     0.1 
Dhrystones per Second:                      19539623.0 
VAX MIPS rating =  11121.015 


 Performance counter stats for './dhrystone':

          5,146.69 msec task-clock:u              #    0.908 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               553      page-faults:u             #  107.448 /sec                   
    21,453,263,692      cycles:u                  #    4.168 GHz                      (50.00%)
         1,574,543      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (50.03%)
    12,575,396,819      stalled-cycles-backend:u  #   58.62% backend cycles idle      (50.04%)
    89,186,371,586      instructions:u            #    4.16  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (50.00%)
     7,717,732,872      branches:u                #    1.500 G/sec                    (49.97%)
           353,303      branch-misses:u           #    0.00% of all branches          (49.96%)

       5.666446006 seconds time elapsed

       5.133037000 seconds user
       0.003322000 seconds sys

As you can see

LTO dhrystone DMIPS is 1953,9623.0 and LTO-less dhrystone is 523,4421.7
LTO dhrystone executes 89,186,371,586 instructions and LTO-less dhrystone executes 319,848,969,156

I think the root cause is that LTO reduces many instructions, so it can run much faster.

But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.

Qeustion

What kind of programs are more easily affected by LTO optimization? Why LTO has a big impact on dhrystone, but not on coremark/coremark-pro.
How does LTO reduce runtime instructions?

Solution

LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h for inlining normally, LTO can greatly simplify code that does a lot of calling such functions.

A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl and ret instructions. And would have to respect the calling convention, so the call-site might need to mov some values to call-preserved registers. But when inlining, the compiler has full control over all the registers.

For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations. (e.g. hoisting work out of loops instead of re-computing something every time.)

Unless you use LTO so it can break your benchmarks. (Or maybe there's another reason with dhrystone, IDK.)