Search code examples
gccarmcompiler-optimizationcortex-minteger-division

Why is ARM gcc calling __udivsi3 when dividing by a constant?


I'm using the latest available version of ARM-packaged GCC:

arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release) Copyright (C) 2020 Free Software Foundation, Inc.

When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":

int main(void) {
    uint16_t num = (uint16_t) ADC1->DR;
    ADC1->DR = num / 7;
}

I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:

 08000b5c <main>:
 8000b5c: b510 push {r4, lr}
 8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
 8000b60: 2107 movs r1, #7
 8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
 8000b64: b280 uxth r0, r0
 8000b66: f7ff facf bl 8000108 <__udivsi3>
 8000b6a: b280 uxth r0, r0
 8000b6c: 6420 str r0, [r4, #64] ; 0x40
 8000b6e: 2000 movs r0, #0
 8000b70: bd10 pop {r4, pc}
 8000b72: 46c0 nop ; (mov r8, r8)
 8000b74: 40012400 .word 0x40012400

Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?


Solution

  • The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.

    From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like

    void test1(uint8_t *p)
    {
        for (int i=0; i<32; i++)
            p[i] = (p[i]*9363) >> 16; // Divide by 7
    }
    

    gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:

    void test2(uint16_t *p)
    {
        register unsigned u9363 = 9363;
        for (int i=0; i<32; i++)
            p[i] = (p[i]+u9363) >> 16;
    }
    

    gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:

    unsigned short test(register unsigned short *p)
    {
        register unsigned short temp = *p;
        return temp - (temp >> 15);
    }    
    

    while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.

    If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.