Search code examples
cfloating-pointx86-64floating-accuracyicc

Why doesn't the same generated assembler code lead to the same output?


Sample code (t0.c):

#include <stdio.h>

float f(float a, float b, float c) __attribute__((noinline));
float f(float a, float b, float c)
{
    return a * c + b * c;
}

int main(void)
{
    void* p = V;
    printf("%a\n", f(4476.0f, 20439.0f, 4915.0f));
    return 0;
}

Invocation & execution (via godbolt.org):

# icc 2021.1.2 on Linux on x86-64
$ icc t0.c -fp-model=fast -O3 -DV=f
0x1.d32322p+26
$ icc t0.c -fp-model=fast -O3 -DV=0
0x1.d32324p+26

Generated assembler code is the same: https://godbolt.org/z/osra5jfYY.

Why doesn't the same generated assembler code lead to the same output?

Why does void* p = f; matter?


Solution

  • Godbolt shows you the assembly emitted by running the compiler with -S. But in this case, that's not the code that actually gets run, because further optimizations can be done at link time.

    Try checking the "Compile to binary" box instead (https://godbolt.org/z/ETznv9qP4), which will actually compile and link the binary and then disassemble it. We see that in your -DV=f version, the code for f is:

     addss  xmm0,xmm1
     mulss  xmm0,xmm2
     ret 
    

    just as before. But with -DV=0, we have:

     movss  xmm0,DWORD PTR [rip+0x2d88]
     ret
    

    So f has been converted to a function which simply returns a constant loaded from memory. At link time, the compiler was able to see that f was only ever called with a particular set of constant arguments, and so it could perform interprocedural constant propagation and have f merely return the precomputed result.

    Having an additional reference to f evidently defeats this. Probably the compiler or linker sees that f had its address taken, and didn't notice that nothing was ever done with the address. So it assumes that f might be called elsewhere in the program, and therefore it has to emit code that would give the correct result for arbitrary arguments.

    As to why the results are different: The precomputation is done strictly, evaluating both a*c and b*c as float and then adding them. So its result of 122457232 is the "right" one by the rules of C, and it is also what you get when compiling with -O0 or -fp-model=strict. The runtime version has been optimized to (a+b)*c, which is actually more accurate because it avoids an extra rounding; it yields 122457224, which is closer to the exact value of 122457225.