Search code examples
cfloating-pointroundingc99ieee-754

IEEE floating-point rounding in C


I am having trouble to understand a specific IEEE double computation. Take the following C99 program, that runs on a host with IEEE double (8 bytes, 11 bits biased exponent, 52 bits encoded mantissa):

#include <stdio.h>
#include <fenv.h>

// Expect IEEE double
int _a[sizeof (double) == 8 ? 1 : -1];

#define PP(X) printf (#X " = %d\n", X)

int main (void)
{
  // Print rounding modes.
  PP (FE_TONEAREST);
  PP (FE_UPWARD);
  PP (FE_DOWNWARD);
  PP (FE_TOWARDZERO);

  // What mode do we have?
  printf ("rounding mode = %d\n", fegetround());

  // Add a and b.
  double a = -0x1.638e38e38e38ep5;
  double b = -0x1.638e38e38e38ep6;
  __asm ("" : "+r" (a));
  __asm ("" : "+r" (b));
  printf ("a   = %a\n", a);
  printf ("b   = %a\n", b);
  printf ("a+b = %a\n", a + b);

  return 0;
}

Compile and run:

$ gcc rounding.c -lm && ./a.out
FE_TONEAREST = 0
FE_UPWARD = 2048
FE_DOWNWARD = 1024
FE_TOWARDZERO = 3072
rounding mode = 0
a   = -0x1.638e38e38e38ep+5
b   = -0x1.638e38e38e38ep+6
a+b = -0x1.0aaaaaaaaaaaap+7

My question is: Why the least significant nibble of the sum is a and not rounded to b as round-to-nearest is on?

In a IEEE double emulation with 56 bits for the decoded mantissa, the computation is like this:

# Printed the double values, same like on the host:
A   = -0x1.638e38e38e38ep5
B   = -0x1.638e38e38e38ep6
# Internal format with 56-bit mantissa. The digit after | indicates
# three extra bits compared to IEEE.
# 
A     = -0x1.638e38e38e38e|0, expo = 5
B     = -0x1.638e38e38e38e|0, expo = 6
A + B = -0x1.0aaaaaaaaaaaa|8, expo = 7

So when A + B gets packed as a double, then this would round to

(double) (A + B) = -0x1.0aaaaaaaaaaabp+7

No?


Solution

  • As your extended-precision data shows, the exact result of the addition is 1.0AAAAAAAAAAAA816•27, where bold shows the part that will fit in an IEEE-754 double-precision number (binary64); the 8 digit is beyond what can be represented.

    The two numbers representable in binary64 that immediately surround this number are 1.0AAAAAAAAAAAA16•27 and 1.0AAAAAAAAAAAB16•27. These are equally distant from 1.0AAAAAAAAAAAA816•27. For the round-to-nearest, ties-to-even method, IEEE 754 4.3.1 says “… if the two nearest floating-point numbers bracketing an unrepresentable infinitely precise result are equally near, the one with an even least significant digit shall be delivered…”

    The least significant binary digits of the candidates are 0 and 1 (since the least significant hexadecimal digits are A, 1010, and B, 1011). 0 is even and 1 is odd, so 1.0AAAAAAAAAAAA16•27 is chosen.