c performance visual-c++inline-assembly fpu

MSVC Inline Assembly: Freeing FPU registers for performance

While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...

For example:

#include <stdio.h>

double fpu_add(register double x, register double y) {
    double res = 0.0;

    __asm {
        fld x
        fld y
        fadd
        fstp res
    }

    return res;
}

int main(void) {
    double x = fpu_add(5.0, 2.0);
    (void) printf("x = %f\n", x);
    
    return 0;
}

When do I have to ffree the FPU registers in Inline Assembly?

In that example would performance be better if I decided to ffree the st(1) register?

Also is fstp a shorthand for instructions below?

__asm {
    fst res
    ffree st(0)
}

NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE

Solution

The ffree instruction allows you to mark any slot of the x87 fo stack as free without actually changing the stack pointer. So ffree st(0) does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.

To actually pop to the stack you need both ffree st(0) and fincstp (to increment the pointer). Or better, fstp st(0) to do both those things with a single cheap instruction. Or fstp st(1) to keep the top-of-stack value and discard the old st(1).

But it's usually even better and easier (and faster) to use the p suffixed versions of other instructions. In your case, you probably want

__asm {
    fld x     // push x on the stack
    fld y     // push y on the stack
    faddp     // pop a value and add it to the (now) tos
    fstp res  // pop and store tos
}

This ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.

Or even better, use memory-source fadd to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)

    __asm {
        fld x     // push x on the stack
        fadd y    // ST0 += y
        fstp res  // pop and store tos
    }

But MSVC inline asm is inherently slow for wrapping a single instruction like fadd, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for the return statement, unless you use a hack like leaving a value in st(0) and falling off the end of a function without a return statement. (MSVC does actually support this even when inlining, but clang-cl / clang -fasm-blocks does not.)

GNU C inline asm could wrap a single fadd instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (in st(0)), but you'd still have to choose between fadd and faddp, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)

Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)