Search code examples
gccarmcortex-mcodewarrior

Fixed point math with ARM Cortex-M4 and gcc compiler


I'm using Freescale Kinetis K60 and using the CodeWarrior IDE (which I believe uses GCC for the complier).

I want to multiply two 32 bit numbers (which results in a 64 bit number) and only retain the upper 32 bits.

I think the correct assembly instruction for the ARM Cortex-M4 is the SMMUL instruction. I would prefer to access this instruction from C code rather than assembly. How do I do this?

I imagine the code would ideally be something like this:

int a,b,c;

a = 1073741824;   // 0x40000000 = 0.5 as a D0 fixed point number
b = 1073741824;   // 0x40000000 = 0.5 as a D0 fixed point number

c = ((long long)a*b) >> 31;  // 31 because there are two sign bits after the multiplication
                             // so I can throw away the most significant bit

When I try this in CodeWarrior, I get the correct result for c (536870912 = 0.25 as a D0 FP number). I don't see the SMMUL instruction anywhere and the multiply is 3 instructions (UMULL, MLA, and MLA -- I don't understand why it is using a unsigned multiply, but that is another question). I have also tried a right shift of 32 since that might make more sense for the SMMUL instruction, but that doesn't do anything different.


Solution

  • The problem you get with optimizing that code is:

    08000328 <mul_test01>:
     8000328:   f04f 5000   mov.w   r0, #536870912  ; 0x20000000
     800032c:   4770        bx  lr
     800032e:   bf00        nop
    

    your code doesnt do anything runtime so the optimizer can just compute the final answer.

    this:

    .thumb_func
    .globl mul_test02
    mul_test02:
        smull r2,r3,r0,r1
        mov r0,r3
        bx lr
    

    called with this:

    c = mul_test02(0x40000000,0x40000000);
    

    gives 0x10000000

    UMULL gives the same result because you are using positive numbers, the operands and results are all positive so it doesnt get into the signed/unsigned differences.

    Hmm, well you got me on this one. I would read your code as telling the compiler to promote the multiply to a 64 bit. smull is two 32 bit operands giving a 64 bit result, which is not what your code is asking for....but both gcc and clang used the smull anyway, even if I left it as an uncalled function, so it didnt know at compile time that the operands had no significant digits above 32, they still used smull.

    Perhaps the shift was the reason.

    Yup, that was it..

    int mul_test04 ( int a, int b )
    {
        int c;
        c = ((long long)a*b) >> 31; 
        return(c);
    }
    

    gives

    both gcc and clang (well clang recycles r0 and r1 instead of using r2 and r3)

    08000340 <mul_test04>:
     8000340:   fb81 2300   smull   r2, r3, r1, r0
     8000344:   0fd0        lsrs    r0, r2, #31
     8000346:   ea40 0043   orr.w   r0, r0, r3, lsl #1
     800034a:   4770        bx  lr
    

    but this

    int mul_test04 ( int a, int b )
    {
        int c;
        c = ((long long)a*b); 
        return(c);
    }
    

    gives this

    gcc:

    08000340 <mul_test04>:
     8000340:   fb00 f001   mul.w   r0, r0, r1
     8000344:   4770        bx  lr
     8000346:   bf00        nop
    

    clang:

    0800048c <mul_test04>:
     800048c:   4348        muls    r0, r1
     800048e:   4770        bx  lr
    

    So with the bit shift the compilers realize that you are only interested in the upper portion of the result so they can discard the upper portion of the operands which means smull can be used.

    Now if you do this:

    int mul_test04 ( int a, int b )
    {
        int c;
        c = ((long long)a*b) >> 32; 
        return(c);
    }
    

    both compilers get even smarter, in particular clang:

    0800048c <mul_test04>:
     800048c:   fb81 1000   smull   r1, r0, r1, r0
     8000490:   4770        bx  lr
    

    gcc:

    08000340 <mul_test04>:
     8000340:   fb81 0100   smull   r0, r1, r1, r0
     8000344:   4608        mov r0, r1
     8000346:   4770        bx  lr
    

    I can see that 0x40000000 considered as a float where you are keeping track of the decimal place, and that place is a fixed location. 0x20000000 would make sense as the answer. I cant yet decide if that 31 bit shift works universally or just for this one case.

    A complete example used for the above is here

    https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/sample01

    and I did run it on an stm32f4 to verify it works and the results.

    EDIT:

    If you pass the parameters into the function instead of hardcoding them within the function:

    int myfun ( int a, int b )
    {
         return(a+b);
    }
    

    The compiler is forced to make runtime code instead of optimize the answer at compile time.

    Now if you call that function from another function with hardcoded numbers:

    ...
    c=myfun(0x1234,0x5678);
    ...
    

    In this calling function the compiler may choose to compute the answer and just place it there at compile time. If the myfun() function is global (not declared as static) the compiler doesnt know if some other code to be linked later will use it so even near the call point in this file it optimizes an answer it still has to produce the actual function and leave it in the object for other code in other files to call, so you can still examine what the compiler/optimizer does with that C code. Unless you use llvm for example where you can optimize the whole project (across files) external code calling this function will use the real function and not a compile time computed answer.

    both gcc and clang did what I am describing, left runtime code for the function as a global function, but within the file it computed the answer at compile time and placed the hardcoded answer in the code instead of calling the function:

    int mul_test04 ( int a, int b )
    {
        int c;
        c = ((long long)a*b) >> 31;
        return(c);
    }
    

    in another function in the same file:

    hexstring(mul_test04(0x40000000,0x40000000),1);
    

    The function itself is implemented in the code:

    0800048c <mul_test04>:
     800048c:   fb81 1000   smull   r1, r0, r1, r0
     8000490:   0fc9        lsrs    r1, r1, #31
     8000492:   ea41 0040   orr.w   r0, r1, r0, lsl #1
     8000496:   4770        bx  lr
    

    but where it is called they have hardcoded the answer because they had all the information needed to do so:

     8000520:   f04f 5000   mov.w   r0, #536870912  ; 0x20000000
     8000524:   2101        movs    r1, #1
     8000526:   f7ff fe73   bl  8000210 <hexstring>
    

    If you dont want the hardcoded answer you need to use a function that is not in the same optimization pass.

    Manipulating the compiler and optimizer comes down to a lot of practice and it is not an exact science as the compilers and optimizers are constantly evolving (for better or worse).
    By isolating a small bit of code in a function you are causing problems in another way, larger functions are more likely to need a stack frame and evict variables from registers to the stack as it goes, smaller functions might not need to do that and the optimizers may change how the code is implemented as a result. You test the code fragment one way to see what the compiler is doing then use it in a larger function and dont get the result you want. If there is an exact instruction or sequence of instructions you want implemented....Implement them in assembler. If you were targeting a specific set of instructions in a specific instruction set/processor then avoid the game, avoid your code changing when you change computers/compilers/etc, and just use assembler for that target. if needed ifdef or otherwise use conditional compile options to build for different targets without the assembler.