performance assembly floating-point x86 x87

Assembly (i386): Math Coprocessor Stack

I was reading about the math coprocessor (Paul Carters PC Assembly Book) and its instructions to make floating point calculations (on ASM i386). Then I ran into the following code that is supposed to return the larger double of two given double values (C Calling Convention):

 1    %define d1 ebp+8
 2    %define d2 ebp+16
 3    global dmax
 4    
 5    segment .text
 6    dmax:
 7        enter 0,0
 8    
 9        fld qword [d2]
10        fld qword [d1] ;Now ST0 = d1 and ST1 = d2
11        fcomip st1 ;Compares ST0 with ST1 and pops ST0 out
12        jna short d2_bigger ;If not above (ST0<ST1)
13        fcomp st0 ;Get rid of ST0, which is actually d2 now (line 11)
14        fld qword [d1]
15        jmp short exit
16    d2_bigger:
17    exit:
18        leave
19        ret

There were two things I was thinking about changing on this code. First, I'd probably use FCOMI instead of FCOMIP on the comparison (line 11) to avoid 1 unnecessary coprocessor register pop. Doing this, if ST0=ST1 there would be no pop at all (since it is already in the top of the stack). The only reason I can see for not doing it would be that it would leave a unempty stack of the coprocessor registers. However, I think the only relevant value for C is ST0, which would be the return value of the double function. If another function pushed more than 8 float/double values to the coprocessor stack, wouldn't the values stored in the lowest members of the coprocessor stack (ST7) just be discarded? So is it really an issue to leave a function without clearing the coprocessor stack? => (READ EDIT)

The second thing I was thinking of changing is I'd probably not use the instruction FCOMP on line 13. I understand the reason it is there is to pop ST0 out of the stack to make ST1 reach the top. However, I think it's a bit of an overhead to make a whole comparison and setting the coprocessor flags just to pop the value. I looked for a instruction only for poping ST0 and apparently there is none. I thought it would be faster though to use FADDP ST0, ST0 (adds ST0 to ST0 and pops ST0 out) or FSTP ST0 (stores the value of ST0 to ST0 and pops ST0 out). They just look in my head like less work for the coprocessor.

I tried to test the speed of the 3 options (the one on the code above, FSTP ST0 and FADDP ST0, ST0) and after a few quick tests they all ran with very similar speeds. Kind of unaccurate to make a conclusion out of the values. Apparently the FADDP ST0,ST0 was a bit faster, followed by the FSTP ST0 and finally the FCOMP ST0. Is there a recommendation on which one to use? Or am I bothering too much about something that will have such a negligible effect on the overall speed?

I just questioned myself because since Assembly is about doing things the fastest way possible, maybe choosing between one of those approaches could have a benefit.

EDIT:

I was reading the Intel 64 and IA-32 Instruction Set Reference and apparently the coprocessor throws an exception if the stack overflows or underflows (Exception #IS). So using the stack and not emptying it (in this case, leaving only the ST0 so C will pop the return value of it) is not an option apparently.

Solution

Modern CPUs handle x87 register stack operations similar to the way they do register renaming for out-of-order execution requires. The P versions of x87 instructions execute with the same performance characteristics as the non-pop versions.

For everything you need to statically analyze latency, throughput, and total uops for this code on modern CPUs, see Agner Fog's microarch guide and instruction tables. Also, the x86 tag wiki for more links.

Oh, and definitely never use the ENTER instruction unless optimizing completely for size without caring about speed. It's insanely slow, even in the 0, 0 case.

Balancing the FP stack:

throws an exception if the stack overflows or underflows

FP exceptions are masked by default in most OSes. The even more important part of the behaviour is that ST0 holds garbage after an FLD that triggered an overflow. So your conclusion is right: following the ABI rules for the x87 stack are important: stack empty on function calls and empty or holding a float/double return value on return. (I'm not aware of any ABIs that do things differently, but you could have a calling convention that passed some FP args in x87 registers instead of the stack.)

C Calling Convention

There is no single calling convention for C across all x86 platforms. Many of the 32-bit ones pass double args on the stack, and return them in ST(0), like you're doing. So this is sort of ok except for the terminology.

In the usual 64-bit calling conventions, double args are passed in XMM registers (each arg in the low element of its own register). There are also 32-bit calling conventions that assume SSE2 and pass doubles this way as well. In that case:

; 64-bit Windows or non-Windows, or 32-bit-with-double-in-SSE2 calling convention:
global dmax
section .text
dmax:
    maxsd   xmm0, xmm1
    ret

Yep, there's an instruction for std::max(double,double). At this point, the function call has more overhead than the instruction, and using an asm function instead of letting a C compiler inline a C function to that instruction is a terrible idea. Especially in calling conventions (like System V, used by non-Windows) where all the XMM registers are call-clobbered, so a caller has to save/restore all double and float temporaries to memory across function calls.

If you had to write this with x87 instructions

fcomp st0 is not the best way to just pop the x87 stack. Use fstp st0 to do that.

It looks like you're assuming a P6 or newer CPU (since you use FCOMI/FCOMIP), so you might as well take advantage of FCMOVcc as well instead of using branches.

; 32-bit args-on-the-stack
section .text
; when one input is NaN, might return NaN or might return the other input
; This implements the C expression  (d1 < d2)
global dmax
dmax:
    fld     qword [esp+12]
    fld     qword [esp+4]     ; ST0 = d1 and ST1 = d2

    fucomi  st0, st1
    jp     handle_nan         ; optional.  MAXSD does this for free.  If you leave this out, I suggest using fcomi instead of fucomi, to raise #IA on NaN
    FCMOVb  st0, st1          ; st0 = (st0<st1) : st1 : st0.  (Also copies if unordered because CF=1 in that case, too.  But we don't know which operand was NaN.)

    ;; our return value is in st0, but st1 is still in use.
    fstp    st1               ; pop the stack while keeping st0.  (store it to st1, which becomes st0 after popping)
    ; alternative: ffree st1   ; I think this should work
    ret

handle_nan:
    faddp                     ; add both args together to get a NaN, whichever one was NaN to start with.
    ret

This has one very predictable branch (NaN probably never happens in real use, or else it always happens). The critical path is a round-trip through memory for arg-passing (~5 cycles), then fucomi(?) -> fcmov(2c) -> fstp st1 (1c). Those cycle counts are for Intel Haswell. Total latency = probably 5 + 5 (assuming 2c for FUCOMI).

Using FFREE st1 (if that works), would take the final fstp off the critical path. FXCHG (zero latency) and then popping st0 might also take that off the critical path. It would be possible for Intel to implement FSTP ST1 with zero latency like FXCHG (handled in the register-rename stage), but I don't think that's the case on any existing microarchitecture. (And unlikely to be a future feature, because x87 is mostly obsolete. IIRC, Intel Skylake slightly reduced the throughput of some x87 stuff vs. Haswell, by making more x87 instructions share the same execution port.)

Intel Haswell throughput: Agner Fog's spreadsheet doesn't list a latency for FUCOMI, but it is 3 uops. FCMOV is also 3 uops, with 2 cycle latency. A branching implementation (to maybe conditionally run FXCHG before popping st0) could possibly be good if used in a case where it predicted very well. Anyway, the total uop count:

2x FLD: 2 uops for port2 or port3
FUCOMI: 3 uops for p0/p1
jcc: 1 uop for p0/p6 (assuming predicted-not-taken)
FCMOV: 3 uops (2p0 1p5)
FSTP reg: 1 uop for p0/p1
ret: 1 uop for p6 (micro-fused with a load for p237. That's funny, I thought p7 was only for simple store-addresses. Maybe a typo in the table)

total fused-domain uops: 10 (not counting the ret). So it takes 2.5 cycles for this to issue (in groups of 4). There might be bottlenecks on a specific execution port, but I didn't check for that.

Turns out gcc agrees with my implementation choice :):

see the code on the Godbolt compiler explorer, compiled with gcc6.2 -m32 -mfpmath=387 -O3 -march=haswell

double dmax(double a, double b) { return a<b ? b : a; }

    fld     QWORD PTR [esp+4]
    fld     QWORD PTR [esp+12]    ;; it doesn't matter which order you load args in, IDK why I chose reverse order
    fucomi  st, st(1)
    fcmovbe st, st(1)             ;; moving when they're equal matches the C, but of course doesn't matter
    fstp    st(1)
    ret