Search code examples
cassemblyfloating-pointfloating-accuracyfpu

Binary64 floating point addition rounding mode error and behaviors difference 32/64 bits


I noticed a rounding error when I tried to add the two following floating point numbers on an Intel core I7 / I5 :

2.500244140625E+00 + 4503599627370496.00 <=> 0x1.4008p+1 + 0x1.0p+52

The addition, made with two double precision constants by the faddl assembly instruction (when I compile with a 32 bits compiler).

The result I obtains is :

4.503599627370498E+15 = 0x1.0000000000002p+52

Instead of :

4.503599627370499E+15 = 0x1.0000000000003p+52

(as I was expected and was confirmed by http://weitz.de/ieee/.)

Demonstration:

0x1.0p+52 = 0x10000000000000.00p+0

0x1.4008p+1 = 0x2.801p+0

0x10000000000000.00p+0 + 0x2.801p+0 = 0x10000000000002.801p+0 (exactly)

0x10000000000002.801p+0 = 0x1.0000000000002801p+52 (exactly)

0x10000000000002.801p+0 = 0x1.0000000000003p+52 (after rounding)

I double check and verify in debugging mode that my FPU is in "round to the nearest mode".

Something witch is even more strange is that when I compile my code with a 64 bits compiler, and then the addsd instruction is used, there is no rounding error.

Does anyone can give me reference or explanation about precision differences on 'double' addition on the same FPU but using different instruction set ?


Solution

  • The FPU registers are 80-bit wide, whenever a single or double precision number is loaded with fld and variants it is converted into the double extended precision by default1.
    Thus fadd usually works with 80-bit numbers.

    The SSE registers are format agnostic and the SSE extensions don't support the double extended precision.
    For example, addpd works with double precision numbers.


    The default rounding mode is round to nearest (even) that means the usual round to nearest but toward the even end in case of a tie (e.g. 4.5 => 4).

    To implement the IEEE 754 requirement to perform arithmetic as with an infinite precision numbers, the hardware need two guards bit and a sticky bit2


    double

    I'll write a double precision number as

    <sign> <unbiased exponent in decimal> <implicit integer part> <52-bit mantissa> | <guard bits> <sticky bit>
    

    The two numbers

    2.500244140625
    4503599627370496
    

    are

    +  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 00
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00
    

    The first one is shifted

    + 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1   
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00 |00 0
    

    The sum is done

    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1
    

    Rounding to nearest (even) gives

    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 11
    

    because 0 |10 1 is closer to 1 |00 0 than 0 |00 0.

    double extended

    The two numbers are

    +  1 1 0100000000 0010000000 0000000000 0000000000 0000000000 0000000000 000
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000
    

    The first is shifted

    + 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000 | 00 0
    

    The sum is done

    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
    

    Rounding to nearest (even):

    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000
    

    as 0 | 10 0 is tie broken to the nearest even.

    When this number is then converted from double extended precision to double precision (due to a fstp QWORD []) the rounding is repeated using bit 52, 53 and 54 of the double extended mantissa as guards and sticky bits

    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000
    
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10|100
    
    + 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10
    

    because 0|100 is again tie broken to the nearest even.


    1 See Chapter 8.5.1.2 of the Intel Manual - Volume 1.
    2 The guard bit are extra precision bits retained after one of the number is shifted to make the exponents match. The sticky bit it the OR of bits less significant than the the least guard. See the "on Rounding" section of this page and Goldberg for a format approach.