Search code examples
cgccgnuc11epsilon

Machine epsilon calculation is different using C11 and GNU11 compiler flags


When using Python & Julia, I can use a neat trick to investigate machine epsilon for a particular floating point representation.

For example, in Julia 1.1.1:

julia> 7.0/3 - 4/3 - 1
2.220446049250313e-16 

julia> 7.0f0/3f0 - 4f0/3f0 - 1f0
-1.1920929f-7

I'm currently learning C and wrote this program to try and achieve the same thing:

#include <stdio.h>

int main(void)
{
  float foo;
  double bar;

  foo = 7.0f/3.0f - 4.0f/3.0f - 1.0f;
  bar = 7.0/3.0 - 4.0/3.0 - 1.0;

  printf("\nM.E. for float: %e \n\n", foo);
  printf("M.E. for double: %e \n\n", bar);

  return 0;
}

Curiously, the answer I get depends on whether I use C11 or GNU11 compiler standard. My compiler is GCC 5.3.0, running on Windows 7 and installed via MinGW.

So in short, when I compile with: gcc -std=gnu11 -pedantic begin.c I get:

M.E. for float: -1.192093e-007

M.E. for double: 2.220446e-016

as I expect, and matches Python and Julia. But when I compile with: gcc -std=c11 -pedantic begin.c I get:

M.E. for float: -1.084202e-019

M.E. for double: -1.084202e-019

which is unexpected. I thought it might by GNU specific features which is why I added the -pedantic flag. I have been searching on google and found this: https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html but I still am unable to explain the difference in behaviour.

To be explicit, my question is: Why is the result different using the different standards?

Update: The same differences apply with C99 and GNU99 standards.


Solution

  • In C, the best way to get the float or double epsilon is to include <float.h> and use FLT_MIN or DBL_MIN.

    The value of 7.0/3.0 - 4.0/3.0 - 1.0; is not fully specified by the C standard because it allows implementations to evaluate floating-point expressions with more precision than the nominal type. To some extent, this can be dealt with by using casts or assignments. The C standard requires casts or assignments to “discard” excess precision. This is not a proper solution in general, because there can be rounding both with the initial excess precision and with the operation that “discards” excess precision. This double-rounding may produce a different result than calculating entirely with the nominal precision.

    Using the cast workaround with the code in the question yields:

    _Static_assert(FLT_RADIX == 2, "Floating-point radix must be two.");
    float FloatEpsilon = (float) ((float) (7.f/3) - (float) (4.f/3)) - 1;
    double DoubleEpsilon = (double) ((double) (7./3) - (double) (4./3)) - 1;
    

    Note that a static assertion is required to ensure that the floating-point radix is as expected for this kludge to operate. The code should also include documentation explaining this bad idea:

    • The binary representation for the fraction ⅓ ends in an infinite sequences of “01010101…”.
    • When the binary for 4/3 or 7/3 is rounded to a fixed precision, it is as if the numeral were truncated and rounded down or up, depending on whether the next binary digit after truncation were a 0 or a 1.
    • Given our assumption that floating-point uses a base-two radix, 4/3 and 7/3 are in consecutive binades (4/3 is in [1, 2), and 7/3 is in [2, 4). Therefore, their truncation points are one position apart.
    • Thus, we converting to a binary floating-point format, 4/3 and 7/3 differ in that the latter exceeds the former by 1 and its significand ends one bit sooner. Examination of the possible truncation points reveals that, aside from the initial difference of 1, the significands differ by the value of the position of the low bit in 4/3, although the difference may be in either direction.
    • By Sterbenz’ Lemma, there is no floating-point error in subtracting 4/3 from 7/3, so the result is exactly 1 plus the difference described above.
    • Subtracting 1 produces that difference, which is the value of the position of the low bit of 4/3 except that it may be positive or negative.