Search code examples
c++x86floating-pointieee-754

Fast way to get a close power-of-2 number (floating-point)


In numerical computation, it is often needed to scale numbers to be in safe range.

For example, computing Euclidean distance: sqrt(a^2+b^2). Here, if magnitude of a or b is too small/large, then underflow/overflow can happen.

A common approach to solve this is to divide numbers by the largest magnitude number. However, this solution is:

  • slow (division is slow)
  • causes a little extra inaccuracy

So I thought that instead of dividing by the largest magnitude number, let's multiply it with a close power-of-2 reciprocal number. This seems a better solution, as:

  • multiplication is much faster than division
  • better accuracy, as multiplying with a power-of-2 number is exact

So, I'd like to create a small utility function, which has a logic like this (by ^, I mean exponentiation):

void getScaler(double value, double &scaler, double &scalerReciprocal) {
    int e = <exponent of value>;
    if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }
    } else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }
    } else { scaler=2^e; scalerReciprocal = 2^(2046-e); }
}

This function should return a normalized scaler & scalerReciprocal, both are power-of-2 numbers, where scaler is near to value, and scalerReciprocal is the reciprocal of scaler.

The maximum allowed exponents for scaler/scaleReciprocal are -1022..1022 (I don't want to work with subnormal scaler, as subnormal numbers can be slow).

What would be a fast way to do this? Can this be done with pure floating-point operations? Or should I extract the exponent from value, and use simple ifs to do the logic? Is there some kind of trick to do the comparison with (-)1022 fast (as the range is symmetric)?

Note: scaler doesn't need to be the closest possible power-of-2. If some logic needs it, scaler can be some small power-of-2 away from the closest value.


Solution

  • Based on wim's answer, here's another solution, which can be faster, as it has one less instruction. The output is a little bit different, but still fulfills the requirements.

    The idea is to use bit operations to fix border cases: put a 01 to the lsb of the exponent, no matter of its value. So, exponent:

    • 0 becomes 1 (-1023 becomes -1022)
    • 2046 becomes 2045 (1023 becomes 1022)
    • other exponents modified as well, but just slightly: the number can become two times larger compared to wim's solution (when exponent lsb changes from 00 to 01), or halved (when 10->01) or 1/4 (when 11->01)

    So, this modified routine works (and I think that it's pretty cool that the problem can be solved with only 2 fast asm instructions):

    #include<stdio.h>
    #include<stdint.h>
    #include<immintrin.h>
    /* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */
    
    union dbl_int64{
        double d;
        uint64_t i;
    };
    
    double get_scale(double t){
        union dbl_int64 x;
        uint64_t and_i;
        uint64_t or_i;
             /* 0xFEDCBA9876543210 */
        and_i = 0x7FD0000000000000ull;
        or_i =  0x0010000000000000ull;
        x.d = t;
        x.i = (x.i & and_i)|or_i;                     /* Set fraction bits to zero, take absolute value */
        return x.d;
    }
    
    double get_scale_x86(double t){
        __m128d x = _mm_set_sd(t);
        __m128d x_and = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
        __m128d x_or  = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
                x     = _mm_and_pd(x, x_and);
                x     = _mm_or_pd(x, x_or);
        return _mm_cvtsd_f64(x);
    }
    
    /* Compute the inverse 1/t of a double t with all zero fraction bits     */
    /* and exponent between the limits of function get_scale                 */
    /* A single integer subtraction is much less expensive than a            */
    /* floating point division.                                               */
    double inv_of_scale(double t){
        union dbl_int64 x;
                         /* 0xFEDCBA9876543210 */
        uint64_t inv_mask = 0x7FE0000000000000ull;
        x.d = t;
        x.i = inv_mask - x.i;
        return x.d;
    }
    
    double inv_of_scale_x86(double t){
        __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
        __m128d x        = _mm_set_sd(t);
        __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
        return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
    }
    
    
    int main(){
        int n = 14;
        int i;
        /* Several example values, 4.94e-324 is the smallest subnormal */
        double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,  
                         1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};
        double z, s, u;
    
        printf("Portable code:\n");
        printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
        for (i = 0; i < n; i++){  
            z = y[i];
            s = get_scale(z);
            u = inv_of_scale(s);
            printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
        }
    
        printf("\nx86 specific SSE code:\n");
        printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
        for (i = 0; i < n; i++){  
            z = y[i];
            s = get_scale_x86(z);
            u = inv_of_scale_x86(s);
            printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
        }
    
        return 0;
    }