Search code examples
cassemblygccclangcompiler-explorer

Examples of 'falign-loops' optimisation occuring?


One pass run by the compiler when optimising in gcc is falign-loops.

Although a vague description is provided here: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/data-options/falign-loops-qalign-loops.html

It is listed as one of the optimisations occurring with the -O2 flag here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I have been unable to actually see it work in action with any piece of code I have tried using compiler explorer. Does anyone know how the flag functions and perhaps have some explicit examples?

Thanks


Solution

  • When I build this example function:

    unsigned summer( unsigned accum, int * ptr, unsigned N )
    {
        for(unsigned i = 0; i < N; ++i )
        {
            accum += *ptr++;
        }
        return accum;
    }
    

    With compiler-explorer's ARM gcc 8.5(linux), CFLAGS="-O3 -Wall -Wextra -mcpu=cortex-m4 -falign-loops=4", at first I don't see evidence of loop alignment:

    summer(unsigned int, int*, unsigned int):
        cbz     r2, .L9
        push    {r4}
        movs    r3, #0
    .L3:
        ldr     r4, [r1], #4
        adds    r3, r3, #1
        cmp     r2, r3
        add     r0, r0, r4
        bne     .L3
        pop     {r4}
        bx      lr
    .L9:
        bx      lr
    

    After unchecking "Filter->Directives" I see a lot more, here's just the function with unrelated directives removed by hand:

    summer(unsigned int, int*, unsigned int):
        cbz     r2, .L9
        push    {r4}
        movs    r3, #0
    .LVL1:
        .p2align 2 #Align instructions to 2(number) to the power of 2(because .p2align)
    .L3:
        ldr     r4, [r1], #4
        adds    r3, r3, #1
        cmp     r2, r3
        add     r0, r0, r4
        bne     .L3
        pop     {r4}
        bx      lr
    .L9:
        bx      lr
    

    But we don't really see the effect of .p2align yet. Re-enabling Filter->Directives and also checking Output->Compile to binary object" we see the additional inserted NOP that's added with -falign-loops=4:

    summer(unsigned int, int*, unsigned int):
        cbz r2, 18 <summer(unsigned int, int*, unsigned int)+0x18>
        push    {r4}
        movs    r3, #0
        nop
        ldr.w   r4, [r1], #4
        adds    r3, #1
        cmp r2, r3
        add r0, r4
        bne.n   8 <summer(unsigned int, int*, unsigned int)+0x8>
        pop {r4}
        bx  lr
        bx  lr
        nop
    

    Now that we see what it is, could we improve it? Perhaps some cores would prefer we combine "movs r3, #0" and "nop" into a single 32-bit wide instruction "movs.w r3,#0". Currently the NOP only applies once per function call, rather than the misaligned 32-bit instruction penalty per loop iteration.