Search code examples
assemblyfloating-pointx86-64nasm

Round floating-point numbers in NASM


I am trying to build a NASM library for a C program. I would like to round a floating-point number given as parameter.

The C function prototype would look like this:

double nearbyint(double x);

I tried using the frndint instruction but I need to push my parameter to the stack first.

Here is what I came up with (does not compile):

bits 64

section .text

global nearbyint

nearbyint:
    push    rbp
    mov     rbp, rsp

    fld     xmm0
    frndint
    fstp    xmm0

    leave
    ret

Solution

  • The only way to get data between x87 and XMM is by bouncing it through memory. e.g. movsd [rsp-8] / fld qword [rsp-8] to use the red-zone.

    But you don't need to use x87 at all, and shouldn't if you want it to be efficient.

    If you have SSE4.1, Use roundsd to round to integer.

    • rint: roundsd xmm0,xmm0, 0b0100 - current rounding mode (bit 2 = 1), inexact exception flag is set if the result != input (bit 3 = 0).
    • nearbyint: roundsd xmm0,xmm0, 0b1100 current rounding mode (bit 2 = 1), inexact exception suppressed (bit 3 = 1).
    • roundsd xmm0,xmm0, 0b1000 : rounding mode override (bit 2 = 0) to _MM_FROUND_TO_NEAREST_INT (bits [1:0] = 00). See roundpd docs for a table. Inexact exception suppressed (bit 3 = 1).

    Without SSE4.1 for roundsd, have a look at what glibc's rint does: it adds 2^52 (bit pattern 0x43300000, 0x00000000), making a number so large that the nearest representable doubles are whole integers. So normal FP rounding to the nearest representable value rounds to integer. IEEE binary64 double has 52 explicit mantissa (aka significand) bits, so the size of this number is not a coincidence.

    (For negative inputs, it uses -2^52)

    Subtracting that again gives you a rounded version of your original number.

    The glibc implementation checks for some special cases (like Inf and NaN), and for exponents less than 0 (i.e. inputs with magnitude smaller than 1.0) it copies in the sign bit of the input. So -0.499 through -0.0 will round to -0.0 instead of 0. Otherwise all small inputs would round to +0, since IEEE sub between equal values is required to produce +0.0. Or -0.0 if the current rounding mode is towards -Inf; a sign fixup makes +0.499 round to +0.0 even if the current rounding mode is TowardNegative.

    A simple way to implement that with SSE2 would be to isolate the sign bit of the input with pand xmm0, [signbit_mask], then OR in the FP bit pattern of 0x43300000 ..., giving you +- 2^52.

    default rel
    
    ;; UNTESTED.  IDK if the SIGNBIT_FIXUP does anything other than +-0.0
    rint_sse2:
        ;movsd  xmm1, [signbit_mask]  ; can be cheaply constructed on the fly, unlike 2^52
        ;pand   xmm1, xmm0
    
        pcmpeqd  xmm1, xmm1
        psrlq    xmm1, 1             ; 0x7FFF...
    %ifdef SIGNBIT_FIXUP
        movaps   xmm2, xmm1          ; copy the mask
    %endif
    
        andnps   xmm1, xmm0          ; isolate sign bit
    
    %ifdef SIGNBIT_FIXUP
        movaps   xmm3, xmm1          ; save another copy of the sign bit
    %endif
    
        orps     xmm1, [big_double]  ; +-2^52
        addsd    xmm0, xmm1
        subsd    xmm0, xmm1
    
    %ifdef SIGNBIT_FIXUP
        andps    xmm0, xmm2          ; clear the sign bit
        orps     xmm0, xmm3          ; apply the original sign
    %endif
        ret
    
    section .rodata
    align 16
       big_double: dq 0x4330000000000000   ; 2^52
       ; a 16-byte load will grab whatever happens to be next
       ; if you want a packed rint(__m128d), use   times 2 dq ....
    

    Especially if you omit the SIGNBIT_FIXUP stuff, this is pretty cheap, not much more expensive than roundsd's 2 uops in terms of FP latency. (roundsd has the same latency as addsd + subsd on most CPUs. This is almost certainly not a coincidence, but it does avoid any separate uops for sorting out the sign).