Search code examples
floating-pointprecisionexponentiationmathematical-expressions

What gives best precision, exponential of the difference or the quotient of the exponentials?


Given a typical programming language.

I have two numbers floating point numbers a and b which are close to one another (i.e. their difference is much smaller, in absolute value, than their average).

|a-b| << |a+b|/2

Mathematically, we have that

exp(a-b) = exp(a)/exp(b) .

But when you programme, you can choose to first calculate (a-b) and then exponentiate it, or exponentiate a, then b, then divide them.

If a and b are really close to one another, then the precision on (a-b) might be bad.

Example

(1+pi*10^-20) -(1+1.1*pi*10^-20) = -pi*10^-21

but if you are using floating points with only 19 decimal points of precision. You will get zero as an answer, which is a bad precision. You could have gotten a better precision by reordering the operation as follows

(1-1) + (pi*10^-20 -1.1*pi*10^-20) = -pi*10^-21

which will give you -pi*10^-21 with 19 decimal points of precision.

Therefore my question is, given a finite floating point precision, which way of calculating exp(a-b) gives the better precision?

The exponential of the difference:

exp(a-b)

or the quotient of the exponentials

exp(a)/exp(b)

?


Solution

  • If a and b are really close to one another, then the precision on (a-b) might be bad.

    On the contrary, if a and b are close, then a - b is exact (Sterbenz lemma). Therefore exp(a-b) involves only one rounding step (assume a correctly rounded exp function to simplify reasoning). As a consequence, exp(a-b) gives the correctly rounded result for the expression you are trying to compute. Any other way will never be better than this.

    Circumstances in which exp(a-b) is dramatically better than the alternative are when exp(a) and exp(b) individually overflow, resulting in NaN for the division. By contrast, exp(a-b) only produces +inf when that's the best floating-point approximation of the mathematical result, and never produces NaN for finite a and b.


    Note: the problem that you may be facing is that a and b have been computed so approximately that the relative accuracy of a - b is terrible. This is an inherent problem of floating-point (cancellation), but at the time of computing exp(a-b) in any way, it is too late to try to solve this problem: the information has already been lost. You can only compute exp(a-b) for the a and the b you have. And the correct way to do this is with exp(a-b).


    For simple expressions for which one expects extra precision bits to provide extra accuracy for the end result, it is also possible to test these things empirically by computing a reference result with extra precision. If I had approached the question that way, I might have written the C program (for my platform where FLT_EVAL_METHOD is defined by the compiler as 0 and where long double is IEEE 754 80-bit double-extended):

    #include <math.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    double best(double a, double b) {
      return exp(a-b);
    }
    
    double other(double a, double b) {
      return exp(a) / exp(b);
    }
    
    long double reference(long double a, long double b) {
      // assume the method doesn't matter so much with
      // long double computations, use any method:
      return expl(a-b);
    }
    
    int main(void) {
      for (int i = 0; i < 10; i++) {
        double a = rand() ^ ((long long)rand())<<16 ^ ((long long)rand()) << 32;
        a /= 0x1.0p64;
        double b = a * (1 + (double)rand() / (5.0 * RAND_MAX));
        printf("a=%a\nb=%a\n", a, b);
    
        double be = best(a, b);
        double o = other(a, b);
    
        long double r = reference(a, b);
    
        printf("error sub then exp:%La\n", fabsl(r - be));
        printf("error exp then div:%La\n", fabsl(r - o));
      }
    }
    

    A run of the above program produces the results:

    ~ $ gcc ex.c && ./a.out
    a=0x1.82def03cebc5p-2
    b=0x1.a65bc869d479cp-2
    error sub then exp:0xa.dp-60
    error exp then div:0xa.dp-60
    a=0x1.8164b7a7be6dep-6
    b=0x1.b5b82c63fa0bcp-6
    error sub then exp:0x8.1p-58
    error exp then div:0x8.1p-58
    a=0x1.88b779e295f18p-3
    b=0x1.b1836e39a5186p-3
    error sub then exp:0x8.5p-59
    error exp then div:0xd.ecp-57
    a=0x1.b5f437ec6fc4ap-6
    b=0x1.e459d0a1d64b9p-6
    error sub then exp:0xa.1p-59
    error exp then div:0xd.7cp-57
    a=0x1.889e1b20a7c9dp-3
    b=0x1.8dddc5217d12fp-3
    error sub then exp:0x9.4p-61
    error exp then div:0xf.6cp-57
    a=0x1.2d8f07147984cp-2
    b=0x1.65acc944015e6p-2
    error sub then exp:0x8.ep-58
    error exp then div:0x8.ep-58
    a=0x1.78b8432157465p-5
    b=0x1.a9fd15e131562p-5
    error sub then exp:0x9.4p-61
    error exp then div:0xf.6cp-57
    a=0x1.d214f566a0e1ep-2
    b=0x1.0c90cb63e50c4p-1
    error sub then exp:0xd.54p-58
    error exp then div:0xd.54p-58
    a=0x1.78dfa1c5c2ac4p-2
    b=0x1.919d3723ae61p-2
    error sub then exp:0x9.e8p-59
    error exp then div:0x9.e8p-59
    a=0x1.fb68c3c57fdd3p-2
    b=0x1.103e0374df0bbp-1
    error sub then exp:0xd.54p-58
    error exp then div:0x9.56p-57