Does arm-none-eabi-gcc produce slower code than Keil uVision

I have a simple blinking led program running on STM32f103C8 (without initialization boilerplate):

void soft_delay(void) {
    for (volatile uint32_t i=0; i<2000000; ++i) { }
}

  uint32_t iters = 0;
  while (1)
  {
    LL_GPIO_TogglePin(LED_GPIO_Port, LED_Pin);
    soft_delay();
    ++iters;
  }

It was compiled with both Keil uVision v.5 (default compiler) and CLion using arm-none-eabi-gcc compiler. The surprise is that arm-none-eabi-gcc program runs 50% slower in Release mode (-O2 -flto) and 100% slower in Debug mode.

I suspect 3 reasons:

Keil over-optimization (unlikely, because the code is very simple)
arm-none-eabi-gcc under-optimization due to wrong compiler flags (I use CLion Embedded plugins` CMakeLists.txt)
A bug in the initialization so that chip has lower clock frequency with arm-none-eabi-gcc (to be investigated)

I have not yet dived into the jungles of optimization and disassembling, I hope that there are many experienced embedded developers who already encountered this issue and have the answer.

UPDATE 1

Playing around with different optimization levels of Keil ArmCC, I see how it affects the generated code. And it affects drastically, especially execution time. Here are the benchmarks and disassembly of soft_delay() function for each optimization level (RAM and Flash amounts include initialization code).

-O0: RAM: 1032, Flash: 1444, Execution Time (20 iterations): 18.7 sec

soft_delay PROC
        PUSH     {r3,lr}
        MOVS     r0,#0
        STR      r0,[sp,#0]
        B        |L6.14|
|L6.8|
        LDR      r0,[sp,#0]
        ADDS     r0,r0,#1
        STR      r0,[sp,#0]
|L6.14|
        LDR      r1,|L6.24|
        LDR      r0,[sp,#0]
        CMP      r0,r1
        BCC      |L6.8|
        POP      {r3,pc}
        ENDP

-O1: RAM: 1032, Flash: 1216, Execution Time (20 iterations): 13.3 sec

soft_delay PROC
        PUSH     {r3,lr}
        MOVS     r0,#0
        STR      r0,[sp,#0]
        LDR      r0,|L6.24|
        B        |L6.16|
|L6.10|
        LDR      r1,[sp,#0]
        ADDS     r1,r1,#1
        STR      r1,[sp,#0]
|L6.16|
        LDR      r1,[sp,#0]
        CMP      r1,r0
        BCC      |L6.10|
        POP      {r3,pc}
        ENDP

-O2 -Otime: RAM: 1032, Flash: 1136, Execution Time (20 iterations): 9.8 sec

soft_delay PROC
        SUB      sp,sp,#4
        MOVS     r0,#0
        STR      r0,[sp,#0]
        LDR      r0,|L4.24|
|L4.8|
        LDR      r1,[sp,#0]
        ADDS     r1,r1,#1
        STR      r1,[sp,#0]
        CMP      r1,r0
        BCC      |L4.8|
        ADD      sp,sp,#4
        BX       lr
        ENDP

-O3: RAM: 1032, Flash: 1176, Execution Time (20 iterations): 9.9 sec

soft_delay PROC
        PUSH     {r3,lr}
        MOVS     r0,#0
        STR      r0,[sp,#0]
        LDR      r0,|L5.20|
|L5.8|
        LDR      r1,[sp,#0]
        ADDS     r1,r1,#1
        STR      r1,[sp,#0]
        CMP      r1,r0
        BCC      |L5.8|
        POP      {r3,pc}
        ENDP

TODO: benchmarking and disassembly for arm-none-eabi-gcc.

Solution

There are many factors at play here. Certainly if you have an optimizing compiler and you compare optimized vs not DEPENDING ON THE CODE you can see a large difference in execution speed. Using the volatile in the tiny loop here actually masks some of that, in both cases it should be read/written to memory every loop.

But the calling code the loop variable unoptimized would touch ram two or three times in that loop, optimized ideally would be in a register the whole time, making for a dramatic difference in execution performance even with zero wait state ram.

The toggle pin code is relatively large (talking to the peripheral directly would be less code), depending on whether that library was compiled separately with different options or at the same time with the same options makes a big difference with respect to performance.

Add that this is an mcu and running off of a flash which with the age of this part the flash might at best be half the clock rate of the cpu and worst a number of wait states and I don't remember off hand if ST had the caching in front of it at that time. so every instruction you add can add a clock, so just the loop variable alone can dramatically change the timing.

Being a high performance pipelined core I have demonstrated here and elsewhere that alignment can (not always) play a role, so if in one case the exact same machine code links to address 0x100 in one case and 0x102 in another it is possible that exact same machine code takes extra or fewer clocks to execute based on the nature of the pre-fetcher in the design, or the flash implementation, cache if any implementation, branch predictor, etc.

And then the biggest problem is how did you time this, it is not uncommon for there to be error in not using a clock correctly such that the clock/timing code itself varies and is creating some of the difference. Plus are there background things going on, interrupts/multitasking.

Michal Abrash wrote a wonderful book called The Zen of Assembly Language, you can get it for free in ePub or perhaps pdf form on GitHub. the 8088 was obsolete when the book was released but if you focus on that then you have missed the point, I bought it when it came out and have used what I learned on nearly a daily basis.

gcc is not a high performance compiler it is more of a general purpose compiler built Unix style where you can have different language front ends and different target backends. When I was in the position you are now trying to first understand these things I sampled many compilers for the same arm target and same C code and the results were vast. Later I wrote an instruction set simulator so I could count instructions and memory accesses to compare gnu vs llvm, as the latter has more optimization opportunities than gnu but for execution tests of code gcc was faster sometimes but not slower. That ended up being more of a long weekend having fun than something I used to analyze the differences.

It is easier to start with small-ish code like this and disassemble the two. Understand that fewer instructions doesn't mean faster, one distant memory access on a dram based system can take hundreds of clock cycles, that might be replaced with another solution that takes a handful/dozen of linearly fetched instructions to end up with the same result (do some math vs look up something in a rarely sampled table) and depending on the situation the dozen instructions execute much faster. at the same time the table solution can be much faster. it depends.

Examination of the disassembly often leads to incorrect conclusions (read abrash, not just that book, everything) as first off folks think less instructions means faster code. rearranging instructions in a pipelined processor can improve performance if you move an instruction into a time period that would have otherwise been wasted clocks. incrementing a register not related to a memory access in front of the memory access instead of after in a non-superscaler processor.

Ahh, back to a comment. This was years ago and competing compilers were more of a thing most folks just wrap their gui around gnu and the ide/gui is the product not the compiler. But there was the arm compiler itself, before the rvct tools, ads and I forget the other, those were "better" than gcc. I forget the names of the others but there was one that produced significantly faster code, granted this was Dhrystone so you will also find that they may tune optimizers for Dhrystone just to play benchmark games. Now that I can see how easy it is to manipulate benchmarks I consider them to in general be bu33zzit, can't be trusted. Kiel used to be a multi-target tool for mcus and similar, but then was purchased by arm and I thought at the time they were dropping all other targets, but have not checked in a while. I might have tried them once to get access to a free/demo version of rvct as when I was working at one job we had a budget to buy multi-thousand dollar tools, but that didn't include rvct (although I was on phone calls with the formerly Allant folks who were part of a purchase that became the rvct tools) which I was eager to try once they had finished the development/integration of that product, by then didn't have a budget for that and later couldn't afford or wasn't interested in buying even kiels tools, much less arms. Their early demos of rvct created an encrypted/obfuscated binary that was not arm machine code it only ran on their simulator so you couldn't use it to eval performance or to compare it to others, don't think they were willing to give us an un-obfuscated version and we weren't willing to reverse engineer it. now it is easier just to use gcc or clang and hand optimize where NEEDED. Likewise with experience can write C code that optimizes better based on experience examining compiler output.

You have to know the hardware, particularly in this case where you take processor IP and most of the chip is not related to the processor IP and most of the performance is not related to the processor IP (pretty much true for a lot of platforms today in particular your server/desktop/laptop). The Gameboy Advance for example used a lot of 16 bit buses instead of 32, thumb tools were barely being integrated, but thumb mode while counting instructions/or bytes was like 10% more code at the time, executed significantly faster on that chip. On other implementations both arm architecture and chip design thumb may have performance penalties, or not.

ST in general with the cortex-m products tends to put a cache in front of the flash, sometimes they document it and provide enable/disable control sometimes not so it can be difficult at best to get a real performance value as the typical thing is to run the code under test many times in a loop so you can get a better time measurement. other vendors don't necessarily do this and it is much easier to see the flash wait states and get a real, worst case, timing value that you can use to validate your design. caches in general as well as pipelines make it difficult at best to get good, repeatable, reliable numbers to validate your design. So for example you sometimes cannot do the alignment trick to mess with performance of the same machine code on an st but on say a ti with the same core you can. st might not give you the icache in a cortex-m7 where another vendor might since st has already covered that. Even within one brand name though don't expect the results of one chip/family to translate to another chip/family even if they use the same core. Also look at subtle comments in the arm documentation as to whether some cores offer a fetch size as an advertised option, single or multi-cycle multiply, etc. and I'll tell you that there are other compile time options for the core that are not shown in the technical reference manual that can affect performance so don't assume that all cortex-m3s are the same even if they are the same revision from arm. The chip vendor has the source so they can go even further and modify it, or for example a register bank to be implemented by the consumer they might change it from no protection to parity to ecc which might affect performance while retaining all of arms original code as is. When you look at an avr or pic though or even an msp430, while I cant prove it those designs appear more static not tiny vs Xmega vs regular old avr as there are definite differences there, but one tiny to another.

Your assumptions are a good start, there really isn't such a thing as over-optimization, more of a thing of missed optimizations, but there may be other factors you are not seeing in your assumptions that may or may not be see in a disassembly. there are obvious things that we would expect like one of the loop variables to be register based vs memory based. alignment, I wouldn't expect the clock settings to change if you used the same code, but a different set of experiments using timers or a scope you can measure the clock settings to see if they were configured the same. background tasks, interrupts and dumb luck as to how/when they hit the test. But bottom line, sometimes it is as simple as a missed optimization or subtle differences in how one compiler generates code to another, it is as often not those things and more of a system issue, memory speed, peripheral speed, caches or their architecture, how the various busses in the design operate, etc. For some of these cortex-ms (and many other processors) you can exploit their bus behavior to show a performance difference in something that the average person wouldn't expect to see.