Search code examples
precisioninline-assemblyieee-754expelementary-functions

IEEE-754 double precision and splitting method


When you compute elementary functions, you apply constant modification. Specially in the implementation of exp(x). In all these implementations any correction with ln(2) is done in two steps. ln(2) is split in two numbers:

static const double ln2p1   = 0.693145751953125;
static const double ln2p2   = 1.42860682030941723212E-6;
// then ln(2) = ln2p1 + ln2p2

Then any computation with ln(2) is done by:

 blablabla -= ln2p1
 blablabla -= ln2p2

I know it is to avoid rounding effect. But why this two numbers in specially ? Some of you have an idea how get these two numbers ?

Thank you !

Following the first comment I complete this post with more material and a very weird question. I worked with my team and we agree the deal is to double potentially the precision by splitting the number ln(2) in two numbers. For this, two transformations are applied, the first one:

1) c_h = floor(2^k ln(2))/2^k
2) c_l = ln(2) - c_h

the k indicates the precisions, in look likes in Cephes library (~1980) for float k has been fixed on 9, 16 for double and also 16 for long long double (why I do not know). So for double c_h has a precision of 16 bits but 52 bits for c_l.

From this, I write the following program, and determine c_h with 52 bit precision.

 #include <iostream>
 #include <math.h>
 #include <iomanip>

 enum precision { nine = 9, sixteen = 16, fiftytwo = 52 };

 int64_t k_helper(double x){
     return floor(x/log(2));
 }

 template<class C>
 double z_helper(double x, int64_t k){
     x -= k*C::c_h;
     x -= k*C::c_l;
     return x;
 }

 template<precision p>
 struct coeff{};

 template<>
 struct coeff<nine>{
     constexpr const static double c_h = 0.693359375;
     constexpr const static double c_l = -2.12194440e-4;
 };

 template<>
 struct coeff<sixteen>{
     constexpr const static double c_h = 6.93145751953125E-1;
     constexpr const static double c_l = 1.42860682030941723212E-6;
 };

 template<>
 struct coeff<fiftytwo>{
     constexpr const static double c_h = 0.6931471805599453972490664455108344554901123046875;
     constexpr const static double c_l = -8.78318343240526578874146121703272447458793199905066E-17;
 };


 int main(int argc, const char * argv[]) {

    double x = atof(argv[1]);
    int64_t k = k_helper(x);

    double z_9  = z_helper<coeff<nine> >(x,k);
    double z_16 = z_helper<coeff<sixteen> >(x,k);
    double z_52 = z_helper<coeff<fiftytwo> >(x,k);


    std::cout << std::setprecision(16) << " 9  bits precisions " << z_9 << "\n"
                                       << " 16 bits precisions " << z_16 << "\n"
                                       << " 52 bits precisions " << z_52 << "\n";



    return 0;

}

If I compute now for a set of different values I get:

bash-3.2$ g++ -std=c++11 main.cpp  
bash-3.2$ ./a.out 1
9  bits precisions 0.30685281944
16 bits precisions 0.3068528194400547
52 bits precisions 0.3068528194400547
bash-3.2$ ./a.out 2
9  bits precisions 0.61370563888
16 bits precisions 0.6137056388801094
52 bits precisions 0.6137056388801094
bash-3.2$ ./a.out 100
9  bits precisions 0.18680599936
16 bits precisions 0.1868059993678755
52 bits precisions 0.1868059993678755
bash-3.2$ ./a.out 200
9  bits precisions 0.37361199872
16 bits precisions 0.3736119987357509
52 bits precisions 0.3736119987357509
bash-3.2$ ./a.out 300
9  bits precisions 0.56041799808
16 bits precisions 0.5604179981036264
52 bits precisions 0.5604179981036548
bash-3.2$ ./a.out 400
9  bits precisions 0.05407681688
16 bits precisions 0.05407681691155647
52 bits precisions 0.05407681691155469
bash-3.2$ ./a.out 500
9  bits precisions 0.24088281624
16 bits precisions 0.2408828162794319
52 bits precisions 0.2408828162794586
bash-3.2$ ./a.out 600
9  bits precisions 0.4276888156
16 bits precisions 0.4276888156473074
52 bits precisions 0.4276888156473056
bash-3.2$ ./a.out 700
9  bits precisions 0.61449481496
16 bits precisions 0.6144948150151828
52 bits precisions 0.6144948150151526

It like when x becomes larger than 300 a difference appear. I had a look on the the implementation of gnulibc

http://osxr.org:8080/glibc/source/sysdeps/ieee754/ldbl-128/s_expm1l.c

presently it is using the 16 bits prevision for c_h (line 84)

Well I am probably missing something, with the IEEE standard, and I can not imagine an error of precision in glibc. What do you think ?

Best,


Solution

  • ln2p1 is exactly 45426/65536. This can be obtained by round(65536 * ln(2)). ln2p2 is simply the remainder. So what's so special about the two number is the denominator 65536 (216).

    From what I found most algorithms using this constant can be traced back to the cephes library, which was first released in 1984 where 16-bit computing was still dominating, which probably explains why 216 is chosen.