Search code examples
opengl-esfloating-pointglsl

How to emulate double precision using two floats in OpenGL ES?


I am working on creating deep zooms into the Mandelbrot set, and as you might know OpenGL ES does not support the double data type. The highest precision that it can offer is that of a IEEE 754 float. On Googling, and after a lot of searching, I came across this blog: https://blog.cyclemap.link/2011-06-09-glsl-part2-emu/ that is totally dedicated to this topic. But, unfortunately I cannot understand the code for addition, subtraction and multiplication, presented there. Especially, I can't understand the part that deals with the error correction and carrying. It would be extremely helpful, if you could explain to me, the depth of the error checking and carrying over of bits from low parts to higher. So, far I only understand the fundamental concept of splitting the double into two floats. But, the implementation of the basic operations is unclear to me. It would be very helpful if the explanation is done using the context of binary numbers.


Solution

  • First the basic principle that is used to deal with this. Once you add or substract numbers with high exponent difference the result gets rounded:

    12345600000 + 8.76542995683848E-4 = 12345600000
    

    Now as I showed you in here we can store our numbers as sum of more floats for example vec2, vec3, vec4 which are still floats but together can combine to bigger overall mantissa bitwidth. The link in your question is not using exponent ranges like I did but uses difference between rounded and not rounded results. However the linked lib uses just vec2 which is still much less precise than native 64 bit double as mantissa of double has 53 bits and float has just 24 bits so 24+24 = 48 < 53 that is why I decided to use vec3. Now the trick is to obtain the rounding error. For the same example as above:

    a=12345600000 
    b=8.76542995683848E-4
    c=a+b=12345600000
    

    the a,b are float operands for + operation and c is rounded result. so the difference e can be obtained like this:

    e=c-a; // e= 0
    e-=b;  // e=-8.76542995683848E-4
    e=-e;  // e=+8.76542995683848E-4
    

    where e is the stuff that should be added to c in order to get not rounded result.

    Now if we store some part of number in each component of vec3 then we can try to add this e to all of them (always removing added part from e) until e is zero.

    So if c.x+e does round we add it to c.y and so on... Based on this I managed to compose this:

    //---------------------------------------------------------------------------
    //--- High Precision float ver: 1.000 ---------------------------------------
    //---------------------------------------------------------------------------
    #ifndef _GLSL_HP32
    #define _GLSL_HP32
    //---------------------------------------------------------------------------
    // helper functions (internals)
    void hp32_nor(vec3 &c)          // bubble sort c coordinates desc by magnitude
        {
        float x;
        if (fabs(c.x)<fabs(c.y)){ x=c.x; c.x=c.y; c.y=x; }
        if (fabs(c.x)<fabs(c.z)){ x=c.x; c.x=c.z; c.z=x; }
        if (fabs(c.y)<fabs(c.z)){ x=c.y; c.y=c.z; c.z=x; }
        }
    void hp32_err(vec3 &c,vec3 &e)  // c+=e; apply rounding error e corection to c
        {
        float q;
        q=c.x; c.x+=e.x; e.x=e.x-(c.x-q);
        q=c.x; c.x+=e.y; e.y=e.y-(c.x-q);
        q=c.x; c.x+=e.z; e.z=e.z-(c.x-q);
        q=c.y; c.y+=e.x; e.x=e.x-(c.y-q);
        q=c.y; c.y+=e.y; e.y=e.y-(c.y-q);
        q=c.y; c.y+=e.z; e.z=e.z-(c.y-q);
        q=c.z; c.z+=e.x; e.x=e.x-(c.z-q);
        q=c.z; c.z+=e.y; e.y=e.y-(c.z-q);
        q=c.z; c.z+=e.z; e.z=e.z-(c.z-q);
        hp32_nor(c);
        }
    void hp32_split(vec3 &h,vec3 &l,vec3 a) // (h+l)=a; split mantissas to half
        {
        const float n=8193.0; // 10000000000001 bin uses ~half of mantissa bits
        h=a*n;  // this shifts the a left by half of mantissa (almost no rounding yet)
        l=h-a;  // this will round half of mantissa as h,a have half of mantisa bits exponent difference
        h-=l;   // this will get rid of the `n*` part from number leaving just high half of mantisa from original a
        l=a-h;  // this is just the differenc ebetween original a and h ... so lower half of mantisa beware might change sign
        }
    //---------------------------------------------------------------------------
    // double api (comment it out if double not present)
    vec3 hp32_set(double a)                                             // double -> vec2
        {
        vec3 c;
        c.x=a; a-=c.x;
        c.y=a; a-=c.y;
        c.z=a; hp32_nor(c);
        return c;
        }
    double hp32_getl(vec3 a){ double c; c=a.z+a.y; c+=a.x; return c; }  // vec2 -> double
    //---------------------------------------------------------------------------
    // normal api
    vec3 hp32_set(float a){ return vec3(a,0.0,0.0); }                   // float -> vec2
    float hp32_get(vec3 a){ float c; c=a.z+a.y; c+=a.x; return c; }     // vec2 -> float
    vec3 hp32_add(vec3 a,vec3 b)                                        // = a+b
        {
        // c=a+b; addition
        vec3 c=a+b,e; float q;
        // e=(a+b)-c; rounding error
        c.x=a.x+b.x; e.x=c.x-a.x; e.x-=b.x;
        c.y=a.y+b.y; e.y=c.y-a.y; e.y-=b.y;
        c.z=a.z+b.z; e.z=c.z-a.z; e.z-=b.z;
        e=-e; hp32_err(c,e);
        return c;
        }
    vec3 hp32_sub(vec3 a,vec3 b)                                        // = a-b
        {
        // c=a-b; substraction
        vec3 c=a-b,e; float q;
        // e=(a-b)-c; rounding error
        c.x=a.x+b.x; e.x=c.x-a.x; e.x+=b.x;
        c.y=a.y+b.y; e.y=c.y-a.y; e.y+=b.y;
        c.z=a.z+b.z; e.z=c.z-a.z; e.z+=b.z;
        e=-e; hp32_err(c,e);
        return c;
        }
    vec3 hp32_mul_half(vec3 a,vec3 b)                                   // = a*b where a,b are just half of mantissas !!! internal call do not use this !!!
        {
        //  c = (a.x+a.y+a.z)*(b.x+b.y+b.z)     // multiplication of 2 expresions
        //  c = (a.x*b.x)+(a.x*b.y)+(a.x*b.z)   // expanded
        //     +(a.y*b.x)+(a.y*b.y)+(a.y*b.z)
        //     +(a.z*b.x)+(a.z*b.y)+(a.z*b.z)
        //  c = (a.x*b.x)                       // ordered desc by magnitude (x>=y>=z)
        //     +(a.x*b.y)+(a.y*b.x)
        //     +(a.x*b.z)+(a.z*b.x)+(a.y*b.y)
        //     +(a.y*b.z)+(a.z*b.y)
        //     +(a.z*b.z)
        vec3 c,e,f; float q,r;
        // c=a*b; (e,f)=(a*b)-c; multiplication
        c.x=(a.x*b.x);
        r=(a.x*b.y); q=c.x; c.x+=r; e.x=r-(c.x-q);
        r=(a.y*b.x); q=c.x; c.x+=r; e.y=r-(c.x-q);
        c.y=(a.x*b.z);
        r=(a.z*b.x); q=c.y; c.y+=r; e.z=r-(c.y-q);
        r=(a.y*b.y); q=c.y; c.y+=r; f.x=r-(c.y-q);
        c.z=(a.y*b.z);
        r=(a.z*b.y); q=c.z; c.z+=r; f.y=r-(c.z-q);
        r=(a.z*b.z); q=c.z; c.z+=r; f.z=r-(c.z-q);
        e=+hp32_add(e,f); hp32_err(c,e);
        return c;
        }
    vec3 hp32_mul(vec3 a,vec3 b)                                        // = a*b
        {
        vec3 ah,al,bh,bl,c;
        // split operands to halves of mantissa
        hp32_split(ah,al,a);
        hp32_split(bh,bl,b);
        //  c = (ah+al)*(bh+bl) = ah*bh + ah*bl + al*bh + al*bl
        c=           hp32_mul_half(ah,bh) ;
        c=hp32_add(c,hp32_mul_half(ah,bl));
        c=hp32_add(c,hp32_mul_half(al,bh));
        c=hp32_add(c,hp32_mul_half(al,bl));
        return c;
        }
    //---------------------------------------------------------------------------
    #endif
    //---------------------------------------------------------------------------
    

    For now I tested this only on CPU side (C++). In order to use it in GLSL Just comment out or remove the double api functions Which I used to verify accuracy. And change the fabs to abs or declare:

    float fabs(float x){ return abs(x); }
    

    Again I have some normalization function hp32_nor which sorts the components by magnitude so fabs(x)>=fabs(y)>=fabs(z) which is needed to get back to float and multiplication. The +,- does not need it.

    The hp32_err is like addition between normal number and rounding error difference (its a bit horrible but it preserves accuracy as much as it can) described above.

    I did not test this extensively yet!!! Looks like +,-,* operations are precise in comparison with double.

    The multiplication implementation is a bit complex because a*b on floats has the result bitwidth of mantissa as sum of operands bitwidths. So to avoid rounding we need to first split the operands into halves. that can be done like this (analysed from the lib you linked):

    // split float into h,l
    float a,h,l,n;
    n=8193.0;   // n =               10000000000001.00000000000000000000b
    a=123.4567; // a =                      1111011.01110100111010101000b
    h=a*n;      // h =         11110110111100011000.11000000000000000000b
    l=h-a;      // l =         11110110111010011101.01010000000000000000b
    h-=l;       // h =                      1111011.01110000000000000000b
    l=a-h;      // l =                            0.00000100111010101000b
    

    so float has 24 bits of mantisa and 8193 is (24/2)+1=13 bits wide. So once you multiply any float with it the result needs about half of mantissa bits more than present and get round off. The its just a matter of getting back to original operand scale and getting the other half as difference between new half and original float value. All of this is done in helper function hp32_split.

    Now the multiplication c=a*b on halves look like this:

    c = (ah+al)*(bh+bl) = ah*bh + ah*bl + al*bh + al*bl
    

    And each half multiplication a?*b? loos like this:

    c = (a.x+a.y+a.z)*(b.x+b.y+b.z)     // multiplication of 2 expresions
    c = (a.x*b.x)+(a.x*b.y)+(a.x*b.z)   // expanded
       +(a.y*b.x)+(a.y*b.y)+(a.y*b.z)
       +(a.z*b.x)+(a.z*b.y)+(a.z*b.z)
    c = (a.x*b.x)                       // ordered desc by magnitude (x>=y>=z)
       +(a.x*b.y)+(a.y*b.x)
       +(a.x*b.z)+(a.z*b.x)+(a.y*b.y)
       +(a.y*b.z)+(a.z*b.y)
       +(a.z*b.z)
    

    so I divide it to 3 bracket terms per each component of c. Its important to sort the terms by magnitude to prevent rounding errors as much as possible. Along the summing of terms I just accumulate the error similarly like in addition.