Search code examples
logarithmpolynomialsfixed-point

Log2 approximation in fixed-point


I'v already implemented fixed-point log2 function using lookup table and low-order polynomial approximation but not quite happy with accuracy across the entire 32-bit fixed-point range [-1,+1). The input format is s0.31 and the output format is s15.16.

I'm posting this question here so that another user can post his answer (some comments were exchanged in another thread but they prefer to provide comprehensive answer in a separate thread). Any other answers are welcome, I would much appreciate if you could provide some speed vs accuracy details of your algorithm and its implementation.

Thanks.


Solution

  • By simply counting the leading zero bits in a fixed-point number x, one can determine log2(x) to the closest strictly smaller integer. On many processor architectures, there is a "count leading zeros" machine instruction or intrinsic. Where this is not available, a fairly efficient implementation of clz() can be constructed in a variety of ways, one of which is included in the code below.

    To compute the fractional part of the logarithm, the two main obvious contenders are interpolation in a table and minimax polynomial approximation. In this specific case, quadratic interpolation in a fairly small table seems to be the more attractive option. x = 2i * (1+f), with 0 ≤ f < 1. We determine i as described above and use the leading bits of f to index into the table. A parabola is fit through this and two following table entries, computing the parameters of the parabola on the fly. The result is rounded, and a heuristic adjustment is applied to partially compensate for the truncating nature of fixed-point arithmetic. Finally, the integer portion is added, yielding the final result.

    It should be noted that the computation involves right shifts of signed integers which may be negative. We need those right shifts to map to arithmetic right shifts at machine code level, something which is not guaranteed by the ISO-C standard. However, in practice most compilers do what is desired. In this case I used the Intel compiler on an x64 platform running Windows.

    With a 66-entry table of 32-bit words, the maximum absolute error can be reduced to 8.18251e-6, so full s15.16 accuracy is achieved.

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <math.h>
    
    #define FRAC_BITS_OUT (16)
    #define INT_BITS_OUT  (15)
    #define FRAC_BITS_IN  (31)
    #define INT_BITS_IN   ( 0)
    
    /* count leading zeros: intrinsic or machine instruction on many architectures */
    int32_t clz (uint32_t x)
    {
        uint32_t n, y;
    
        n = 31 + (!x);
        if ((y = (x & 0xffff0000U))) { n -= 16;  x = y; }
        if ((y = (x & 0xff00ff00U))) { n -=  8;  x = y; }
        if ((y = (x & 0xf0f0f0f0U))) { n -=  4;  x = y; }
        if ((y = (x & 0xccccccccU))) { n -=  2;  x = y; }
        if ((    (x & 0xaaaaaaaaU))) { n -=  1;         }
        return n;
    }
    
    #define LOG2_TBL_SIZE (6)
    #define TBL_SIZE      ((1 << LOG2_TBL_SIZE) + 2)
    
    /* for i = [0,65]: log2(1 + i/64) * (1 << 31) */
    const uint32_t log2Tab [TBL_SIZE] =
    {
        0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50, 
        0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2, 
        0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c, 
        0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d, 
        0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6, 
        0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a, 
        0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e, 
        0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751, 
        0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa, 
        0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0, 
        0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5, 
        0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b, 
        0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b, 
        0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373, 
        0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8, 
        0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846, 
        0x80000000, 0x816fe50b
    };
    
    #define RND_SHIFT     (31 - FRAC_BITS_OUT)
    #define RND_CONST     ((1 << RND_SHIFT) / 2)
    #define RND_ADJUST    (0x10d) /* established heuristically */
    
    /* 
       compute log2(x) in s15.16 format, where x is in s0.31 format
       maximum absolute error 8.18251e-6 @ 0x20352845 (0.251622232)
    */   
    int32_t fixed_log2 (int32_t x)
    {
        int32_t f1, f2, dx, a, b, approx, lz, i, idx;
        uint32_t t;
    
        /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
        lz = clz (x);
        i = INT_BITS_IN - lz;
        /* normalize f */
        t = (uint32_t)x << (lz + 1);
        /* index table of log2 values using LOG2_TBL_SIZE msbs of fraction */
        idx = t >> (32 - LOG2_TBL_SIZE);
        /* difference between argument and smallest sampling point */
        dx = t - (idx << (32 - LOG2_TBL_SIZE));
        /* fit parabola through closest three sampling points; find coeffs a, b */
        f1 = (log2Tab[idx+1] - log2Tab[idx]);
        f2 = (log2Tab[idx+2] - log2Tab[idx]);
        a = f2 - (f1 << 1);
        b = (f1 << 1) - a;
        /* find function value for argument by computing ((a*dx+b)*dx) */
        approx = (int32_t)((((int64_t)a)*dx) >> (32 - LOG2_TBL_SIZE)) + b;
        approx = (int32_t)((((int64_t)approx)*dx) >> (32 - LOG2_TBL_SIZE + 1));
        approx = log2Tab[idx] + approx;
        /* round fractional part of result */
        approx = (((uint32_t)approx) + RND_CONST + RND_ADJUST) >> RND_SHIFT;
        /* combine integer and fractional parts of result */
        return (i << FRAC_BITS_OUT) + approx;
    }
    
    /* convert from s15.16 fixed point to double-precision floating point */
    double fixed_to_float_s15_16 (int32_t a)
    {
        return a / 65536.0;
    }
    
    /* convert from s0.31 fixed point to double-precision floating point */
    double fixed_to_float_s0_31 (int32_t a)
    {
        return a / (65536.0 * 32768.0);
    }
    
    int main (void)
    {
        double a, res, ref, err, maxerr = 0.0;
        int32_t x, start, end;
    
        start = 0x00000001;
        end =   0x7fffffff;
        printf ("testing fixed_log2 with inputs in [%17.10e, %17.10e)\n",  
                fixed_to_float_s0_31 (start), fixed_to_float_s0_31 (end));
    
        for (x = start; x < end; x++) {
            a = fixed_to_float_s0_31 (x);
            ref = log2 (a);
            res = fixed_to_float_s15_16 (fixed_log2 (x));
            err = fabs (res - ref);
            if (err > maxerr) {
                maxerr = err;
            }
        }
    
        printf ("max. err = %g\n", maxerr);
        return EXIT_SUCCESS;
    }
    

    For completeness, I am showing the minimax polynomial approximation below. The coefficients for such approximations can be generated by several tools such as Maple, Mathematica, Sollya or with homebrew code using the Remez algorithm, which is what I used here. The code below shows the original floating-point coefficients, the dynamic scaling used to maximize accuracy in intermediate computation, and the heuristic adjustments applied to mitigate the impact of non-rounding fixed-point arithmetic.

    A typical approach for computation of log2(x) is to use x = 2i * (1+f) and use approximation of log2(1+f) for (1+f) in [√½, √2], which means that we use a polynomial p(f) on the primary approximation interval [√½-1, √2-1].

    The intermediate computation scales up operands as far as feasible for improved accuracy under the restriction that we want to use a 32-bit mulhi operation as its basic building block, as this is a native instruction on many 32-bit architectures, accessible either via inline machine code or as an intrinsic. As in the table-based code, there are right shifts of signed data which may be negative, and such right shifts must map to arithmetic right shifts, something that ISO-C doesn't guarantee but most C compilers do.

    I managed to get the maximum absolute error for this variant down to 1.11288e-5, so almost full s15.16 accuracy but slightly worse than for the table-based variant. I suspect I should have added one additional term to the polynomial.

    /* on 32-bit architectures, there is often an instruction/intrinsic for this */
    int32_t mulhi (int32_t a, int32_t b)
    {
        return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
    }
    
    #define RND_SHIFT  (25 - FRAC_BITS_OUT)
    #define RND_CONST  ((1 << RND_SHIFT) / 2)
    #define RND_ADJUST (-2) /* established heuristically */
    
    /* 
        compute log2(x) in s15.16 format, where x is in s0.31 format
        maximum absolute error 1.11288e-5 @ 0x5a82689f (0.707104757)
    */   
    int32_t fixed_log2 (int32_t x)
    {
        int32_t lz, i, f, p, approx;
        uint32_t t;
        /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
        lz = clz (x);
        i = INT_BITS_IN - lz;
        /* force (1+f) into range [sqrt(0.5), sqrt(2)] */
        t = (uint32_t)x << lz;    
        if (t > (uint32_t)(1.414213562 * (1U << 31))) {
            i++;
            t = t >> 1;
        }
        /* compute log2(1+f) for f in [-0.2929, 0.4142] */
        f = t - (1U << 31);
        p =              + (int32_t)(-0.206191055 * (1U << 31) -  1);
        p = mulhi (p, f) + (int32_t)( 0.318199910 * (1U << 30) - 18);
        p = mulhi (p, f) + (int32_t)(-0.366491705 * (1U << 29) + 22);
        p = mulhi (p, f) + (int32_t)( 0.479811855 * (1U << 28) -  2);
        p = mulhi (p, f) + (int32_t)(-0.721206390 * (1U << 27) + 37);
        p = mulhi (p, f) + (int32_t)( 0.442701618 * (1U << 26) + 35);
        p = mulhi (p, f) + (f >> (31 - 25));
        /* round fractional part of the result */
        approx = (p + RND_CONST + RND_ADJUST) >> RND_SHIFT;
        /* combine integer and fractional parts of result */
        return (i << FRAC_BITS_OUT) + approx;
    }