Search code examples
c++ieee-754

Round a floating point value to the nearest power of two


While there are many answers to the question of how to find the next power of two of an integer or floating-point number, there aren't as many for finding the nearest power of two of a number.

I already implemented the following :

template <typename T>
static constexpr T round_pow2(T v) {
    if constexpr (std::is_floating_point_v<T>) {
        auto high = static_cast<unsigned long>(std::ceil(v));
        auto low  = static_cast<unsigned long>(std::floor(v));

        if (high == low) {
            return round_pow2<unsigned long>(high);
        } else {
            T a = static_cast<T>(round_pow2<unsigned long>(low));
            T b = static_cast<T>(round_pow2<unsigned long>(high));

            return std::abs(a - v) <= std::abs(b - v) ? a : b;
        }
    } else {
        T high = v - 1;

        for (T i = 1; i < static_cast<T>(sizeof(T)); i *= 2) {
            high |= high >> i;
        }

        high += 1;
        T low = high >> 1;

        return (high - v) < (v - low) ? high : low;
    }
}

Which should work for any non negative and lower than ULLONG_MAX fp value. But this doesn't seem to be optimal. Is there a better (more performant) way of implementing this function ?

Edits: @Eric for my application getting 0 is fine in case v == 0. But this could indeed be a problem to some because 0 is not a power of 2.

@Blixodus Thank you for your answer, it pointed me in the right direction. I created the following function with your idea in mind:


template <typename T>
constexpr T round_p2(T v) {
    if constexpr (std::is_floating_point_v<T>) {
        using R = std::conditional_t<
            std::is_same_v<T, double>, uint64_t,
            std::conditional_t<std::is_same_v<T, float>, uint32_t,
            void
        >>;

        auto [mlen, es, em] = std::is_same_v<T, double> ? std::make_tuple(52, 1024, 0x7FF) : std::make_tuple(23, 128, 0xFF);
        auto y = *reinterpret_cast<R*>(&v);
        return (T(y >> (sizeof(R) * 8 - 1)) * -2 + 1) * (2 << (((y >> mlen) & em) - es + ((y >> mlen - 1) & 0x1)));
    } else {
        using R = std::make_unsigned_t<T>;
        R rv = static_cast<R>(v);
        T sign = 1;

        if constexpr (std::is_signed_v<T>) {
            if (v < 0) {
                rv = static_cast<R>(-v);
                sign = -1;
            }
        }

        R high = rv - 1;

        for (R i = 1; i < static_cast<R>(sizeof(R)); i *= 2) {
            high |= high >> i;
        }

        high += 1;
        R low = high >> 1;

        return sign * static_cast<T>((high - rv) <= (rv - low) ? high : low);
    }
}

It seems to works quite well for my application and generates really nice assembly compared to my first implementation.

Explanation: I am first getting three magic values depending on if v is a float or a double: the first is the length of the mantissa, the second is the amount the exponent need to be decremented by (plus one), and the third is the mask used to extract the exponent from the fp representation.

I then cast the fp value into an unsigned integer of the same size to be able to work on it.

Next, I extract the value which will sign the final result using (T(y >> (sizeof(R) * 8 - 1)) * -2 + 1). This extracts the highest bit of the fp value (the sign, 0 for positive and 1 for negative), and then apply the function f(x) = x * -2 + 1 to it, which gives 1 for x=0 and -1 for x=1.

Eventually, I compute the closest unsigned power of two of the given fp value using Blixodus' formula. Because I don't use the std::pow function in favor of a bit shift (because we're working with powers of two). I need to account for this by removing one to the value we're shifting 2 with (hence the value of es being 1 more than expected).


Solution

  • Prior to C++ 2020, there is no implementation for rounding integers to a power of two that does not use looping, a compiler extension, or bit-shifting that presumes some limit on the width of the type. There are several Stack Overflow questions about that. The code below shows a solution using the C++ std::countl_zero and an alternate solution using GCC’s count-leading-zeros builtin, which will work for any integer type up to unsigned long long.

    #include <cmath>
    
    #if 201703L <= __cplusplus
        #include <bit>
        #include <type_traits>
    #endif
    
    
    template <typename T> static constexpr T round_pow2(T v)
    {
        /*  Since one is the smallest power of two, all numbers less than or equal
            to one round to one.
        */
        if (v <= 1) return 1;
    
        /*  For floating-point, the standard frexp function gives us the fraction
            and exponent, and ldexp applies an exponent.  The fraction is scaled to
            [.5, 1), so, if it is less than or equal to .75, we round down.
        */
        if constexpr (std::is_floating_point_v<T>)
        {
            int exponent = 0;
            T fraction = frexp(v, &exponent);
            return ldexp(.5, exponent + (.75 < fraction));
        }
    
        /*  Here we handle integer types.  The midpoints for rounding to powers of
            two are at 3*2^n.  That is, the transition between rounding to one
            power of two and another occurs at a number that has the form 3*2^n.
            To find which interval v is in, we can divide it by three and then
            find the next lower (instead of nearest) power of two.  To get the
            desired rounding at the midpoint, we use v-1.  So the general algorithm
            is to round (v-1)/3 down to the nearest power of two, then quadruple
            that.  For example:
    
                v   v-1   (v-1)/3   rounded down   quadrupled
                11   10     3            2             8
                12   11     3            2             8
                13   12     4            4            16
    
            Note that (v-1)/3 is not quite right for v=2, as the subtraction of 1
            jumps a full power of two, from 2 to 1.  (v-.01)/3 would work, but we
            want to stick with integer arithmetic.
    
            For the general case, we want 4 * 2**floor(log2((v-1)/3)).  To include
            v=2, we will use 2 * 2**f((v-1)/3), where f(x) is floor(log2(x))+1 but
            clamped to produce at least zero.
    
            If the C++ 2020 std::countl_zero function is available, we use that.
            Otherwise, we use the GCC builtin __builtin_clzll.  In either case, the
            function returns the number of leading zero bits, which depends on the
            width of the type rather than the operand value alone.  To calculate
            the power of two, we get the bit count for a fixed value (zero or one)
            as a reference point.
        */
        else
        {
            #if __cpp_lib_bitops
                /*  std::countl_zero is only provided for unsigned types, so
                    define UT to be the unsigned type corresponding to T.
                */
                using UT  = std::make_unsigned<T>::type;
    
                return static_cast<UT>(2) << std::countl_zero<UT>(0) - std::countl_zero<UT>((v-1)/3));
            #else
                /*  Since __builtin_clzll is not defined for zero operands, we need
                    to ensure its operand is at least 1.  To do this, we change
                    (v-1)/3 to (v-1)/3*2+1.  The doubling increases the power of
                    two by one, so we change the reference point from zero to one,
                    decreasing the number of bits for it by one.
                */
                return 2ull << __builtin_clzll(1) - __builtin_clzll((v-1)/3*2+1);
            #endif
        }
    }