I noticed a rounding error when I tried to add the two following floating point numbers on an Intel core I7 / I5 :
2.500244140625E+00 + 4503599627370496.00 <=> 0x1.4008p+1 + 0x1.0p+52
The addition, made with two double
precision constants by the faddl
assembly instruction (when I compile with a 32 bits compiler).
The result I obtains is :
4.503599627370498E+15 = 0x1.0000000000002p+52
Instead of :
4.503599627370499E+15 = 0x1.0000000000003p+52
(as I was expected and was confirmed by http://weitz.de/ieee/.)
Demonstration:
0x1.0p+52 = 0x10000000000000.00p+0
0x1.4008p+1 = 0x2.801p+0
0x10000000000000.00p+0 + 0x2.801p+0 = 0x10000000000002.801p+0 (exactly)
0x10000000000002.801p+0 = 0x1.0000000000002801p+52 (exactly)
0x10000000000002.801p+0 = 0x1.0000000000003p+52 (after rounding)
I double check and verify in debugging mode that my FPU is in "round to the nearest mode".
Something witch is even more strange is that when I compile my code with a 64 bits compiler, and then the addsd
instruction is used, there is no rounding error.
Does anyone can give me reference or explanation about precision differences on 'double' addition on the same FPU but using different instruction set ?
The FPU registers are 80-bit wide, whenever a single or double precision number is loaded with fld
and variants it is converted into the double extended precision by default1.
Thus fadd
usually works with 80-bit numbers.
The SSE registers are format agnostic and the SSE extensions don't support the double extended precision.
For example, addpd
works with double precision numbers.
The default rounding mode is round to nearest (even) that means the usual round to nearest but toward the even end in case of a tie (e.g. 4.5 => 4).
To implement the IEEE 754 requirement to perform arithmetic as with an infinite precision numbers, the hardware need two guards bit and a sticky bit2
I'll write a double precision number as
<sign> <unbiased exponent in decimal> <implicit integer part> <52-bit mantissa> | <guard bits> <sticky bit>
The two numbers
2.500244140625
4503599627370496
are
+ 1 1 0100000000 0010000000 0000000000 0000000000 0000000000 00
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00
The first one is shifted
+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 00 |00 0
The sum is done
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10 |10 1
Rounding to nearest (even) gives
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 11
because 0 |10 1
is closer to 1 |00 0
than 0 |00 0
.
The two numbers are
+ 1 1 0100000000 0010000000 0000000000 0000000000 0000000000 0000000000 000
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000
The first is shifted
+ 52 0 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000 | 00 0
The sum is done
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000 | 10 0
Rounding to nearest (even):
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000
as 0 | 10 0
is tie broken to the nearest even.
When this number is then converted from double extended precision to double precision (due to a fstp QWORD []
) the rounding is repeated using bit 52, 53 and 54 of the double extended mantissa as guards and sticky bits
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 1010000000 000
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10|100
+ 52 1 0000000000 0000000000 0000000000 0000000000 0000000000 10
because 0|100
is again tie broken to the nearest even.
1 See Chapter 8.5.1.2 of the Intel Manual - Volume 1.
2 The guard bit are extra precision bits retained after one of the number is shifted to make the exponents match. The sticky bit it the OR of bits less significant than the the least guard. See the "on Rounding" section of this page and Goldberg for a format approach.