Search code examples
croundingieee-754explibm

I do *not* want correct rounding for function exp


The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:

(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.

It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.

(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2-44.

Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)

… But how to do this??

(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.

Here is a minimal program showing the phenomenon:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
 {
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
   {
    a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
    c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
   }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
 }

Now after gcc -std=c99 program53.c -lm -o program53:

$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53 
1.000000e+06
Clock time spent: 13292721
$ ./program53 
1.000000e+06
Clock time spent: 13201616

On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):

$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612

Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.

P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/


Solution

  • To answer the general question on why the library functions are required to give correctly rounded results:

    Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.

    In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:

    Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.

    A similar concern exists for computing log(1 + x), for which there is the function log1p(x).

    A quick fix that speeds up the provided testcase:

    #include <stdlib.h>
    #include <stdio.h>
    #include <math.h>
    #include <time.h>
    
    int main (void)
    {
      int i;
      double a, c;
      c = 0;
      clock_t start = clock ();
      for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
        {
          a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
          c += expm1 (a) + 1; // replace exp() with expm1() + 1
        }
      clock_t stop = clock ();
      printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
      printf ("Clock time spent: %d\n", stop - start);
      return 0;
    }
    

    For this case, the timings on my machine are thus:

    Original code

    1.000000e+06

    Clock time spent: 21543338

    Modified code

    1.000000e+06

    Clock time spent: 55076

    Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical

    For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.

    Note that a better description of the problem would enable a better answer.