One pass run by the compiler when optimising in gcc is falign-loops.
Although a vague description is provided here: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/data-options/falign-loops-qalign-loops.html
It is listed as one of the optimisations occurring with the -O2 flag here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
I have been unable to actually see it work in action with any piece of code I have tried using compiler explorer. Does anyone know how the flag functions and perhaps have some explicit examples?
Thanks
When I build this example function:
unsigned summer( unsigned accum, int * ptr, unsigned N )
{
for(unsigned i = 0; i < N; ++i )
{
accum += *ptr++;
}
return accum;
}
With compiler-explorer's ARM gcc 8.5(linux), CFLAGS="-O3 -Wall -Wextra -mcpu=cortex-m4 -falign-loops=4", at first I don't see evidence of loop alignment:
summer(unsigned int, int*, unsigned int):
cbz r2, .L9
push {r4}
movs r3, #0
.L3:
ldr r4, [r1], #4
adds r3, r3, #1
cmp r2, r3
add r0, r0, r4
bne .L3
pop {r4}
bx lr
.L9:
bx lr
After unchecking "Filter->Directives" I see a lot more, here's just the function with unrelated directives removed by hand:
summer(unsigned int, int*, unsigned int):
cbz r2, .L9
push {r4}
movs r3, #0
.LVL1:
.p2align 2 #Align instructions to 2(number) to the power of 2(because .p2align)
.L3:
ldr r4, [r1], #4
adds r3, r3, #1
cmp r2, r3
add r0, r0, r4
bne .L3
pop {r4}
bx lr
.L9:
bx lr
But we don't really see the effect of .p2align yet. Re-enabling Filter->Directives and also checking Output->Compile to binary object" we see the additional inserted NOP that's added with -falign-loops=4:
summer(unsigned int, int*, unsigned int):
cbz r2, 18 <summer(unsigned int, int*, unsigned int)+0x18>
push {r4}
movs r3, #0
nop
ldr.w r4, [r1], #4
adds r3, #1
cmp r2, r3
add r0, r4
bne.n 8 <summer(unsigned int, int*, unsigned int)+0x8>
pop {r4}
bx lr
bx lr
nop
Now that we see what it is, could we improve it? Perhaps some cores would prefer we combine "movs r3, #0" and "nop" into a single 32-bit wide instruction "movs.w r3,#0". Currently the NOP only applies once per function call, rather than the misaligned 32-bit instruction penalty per loop iteration.