Search code examples
c++numericalapproximation

Approximation error when using sqrt and floor


I have to do an enumeration of solutions of an equation and I know that y < x *( sqrt(n) - 1 ), where x, y and n are integers.

My naive approach would be to look for y less or equal than floor( x * ( sqrt( (float)n ) - 1 ) ).


  • Should I be worried about approximation error?

  • For example, if my expression is a little be greater than an integer m, should I be worried to get m-1 at the end?

  • How could I detect such errors?


Solution

  • You should definitely be worried about approximation error, but how worried depends upon the ranges of values for x and n you are concerned about.

    Calculations in IEEE 4-byte floating point representations are going to have errors roughly on the order of one part in 2^23 to 2^24; for an 8-byte representation (i.e, a double), it will be roughly one part in 2^52 to 2^53. You could expect then that you would need to use doubles rather than floats to get an accurate result for 32-bit integers x and n, and that even a double would be insufficient for 64-bit integers.

    As an example, consider the code:

    template <typename F,typename V>
    F approxub(V x,V n) {
        return std::floor(x*std::sqrt(F(n))-x);
    }
    
    uint64_t n=1000000002000000000ull; // (10^9 + 1)^2 - 1
    uint64_t x=3;
    uint64_t y=approxub<double>(x,n);
    

    This gives a value of y=3000000000, but the correct value is 2999999999.

    It's even worse when x is large and n is small: large 64-bit integers are not exactly representable in IEEE doubles:

    uint64_t n=9;
    uint64_t x=5000000000000001111; // 5e18 + 1111
    uint64_t y=approxlb<double>(x,n);
    

    The correct value for y (putting to one side the issue of when n is a perfect square — the true upper bound will be one less in this case) is 2 x = 10000000000000002222, i.e. 1e19 + 2222. The computed y is, however, 10000000000000004096.

    Avoiding floating point approximations

    Suppose you had a function isqrt which exactly computed the integer part of the square-root of an integer. Then you could say

    y = isqrt(x*x*n) - x
    

    and provided that the product x*x*n fit inside your integer type, you would have an exact upper bound (or one more than the upper bound if n is a perfect square.) There's more than one way to write an isqrt function; this is an example implementation based on the material at code codex:

    template <typename V>
    V isqrt(V v) {
        if (v<0) return 0;
    
        typedef typename std::make_unsigned<V>::type U;
        U u=v,r=0;
    
        constexpr int ubits=std::numeric_limits<U>::digits;
        U place=U(1)<<(2*((ubits-1)/2));
    
        while (place>u) place/=4;
        while (place) {
            if (u>=r+place) {
                u-=r+place;
                r+=2*place;
            }
            r/=2;
            place/=4;
        }
        return (V)r;
    }
    

    What if x is too large for this? For example, if our largest integral type has 64 bits, and x is larger than 2^32. The most straightforward solution would be to do a binary search, taking as bounds x r - x and x r, where r = [√n] is the integer square root.