Search code examples
cfloating-pointprecisionapproximationsingle-precision

Single precision argument reduction for trigonometric functions in C


I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp.

My target device does not support any <cmath> or <math.h> methods. It does not provide a FMA, but a MAC ALU. ALU and LU compute in 32 bit format.

My arctan approximation is actually a modified version of the approximation of N.juffa, which approximates arctan on the full range. Sine and cosine function are accurate up to 2 ulp within the range [-pi,pi].

I am now aiming to provide a larger input range (as large as possible, ideally [FLT_MIN,FLT_MAX]) for sine and cosine, which leads me to argument reduction.

I'm currently reading different papers like ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit by K.C.Ng or the paper about this new argument reduction algorithm, but I wasn't able to derive an implementation from it.

Also I want to mention two stackoverflow questions that refer to related problems: There is a approach with matlab and c++ which is based on the first paper I linked. It is actually using matlab, cmath methods and it limits the input to [0,20.000]. The other one was already mentioned in the comments. It is an approach to an implementation of sin and cos in C, using various c-libraries which are not available for me. Since both posts are already several years old, there might be some new findings.

It seems like the algorithm mostly used in this case is to store the number of 2/pi accurate up to the needed number of bits, to be able to compute the modulo calculation accurately and simultaneously avoid cancellation. My device does not provide a large DMEM, which means large look-up tables with hundreds of bits are not possible. This procedure is actually described on page 70 of this reference, which by the way provides a lot of useful informatin about floating point math.

So my question is: Is there another efficient way to reduce the arguments for sine and cosine obtaining single precision avoiding large LUTs? The papers mentioned above actually focus on double precision and use up to 1000 digits, which is not suitable for my usecase.

I actually haven't found any implementation in C nor an implementation aiming single precision calculation, I would be grateful for any sorts of hints /links /examples...


Solution

  • The following code is based on a previous answer in which I demonstrated how to perform a fairly accurate argument reduction for trigonometric functions by using the Cody-Waite method of split constants for arguments small in magnitude, and the Payne-Hanek method for arguments large in magnitude. For details on the Payne-Hanek algorithm see there, for details on the Cody-Waite algorithm see this previous answer of mine.

    Here I have made adjustments necessary to adjust to the restrictions of the asker's platform, in that no 64-bit types are supported, fused multiply-add is not supported, and helper functions from math.h are not available. I am assuming that float maps to IEEE-754 binary32 format, and that there is a way to re-interpret such a 32-bit float as a 32-bit unsigned integer and vice versa. I have implemented this re-interpretation via the standard portable idiom, that is, by using memcpy(), but other methods may be chosen appropriate for the unspecified target platform, such as inline assembly, machine-specific intrinsics, or volatile unions.

    Since this code is basically a port of my previous code to a more restrictive environment, it lacks perhaps the elegance of a de novo design specifically targeted at that environment. I have basically replaced the frexp() helper function from math.h with some bit twiddling, emulated 64-bit integer computation with pairs of 32-bit integers, replaced the double-precision computation with 32-bit fixed-point computation (which worked much better than I had anticipated), and replaced all FMAs with the unfused equivalent.

    Re-working the Cody-Waite portion of the argument reduction took quite a bit of work. Clearly, without FMA available, we need to ensure a sufficient number of trailing zero bits in the constituent parts of the constant π/2 (except the least significant one) to make sure the products are exact. I spent several hours experimentally puzzling out a particular split that delivers accurate results but also pushes the switchover point to the Payne-Hanek method as high as possible.

    When USE_FMA = 1 is specified, the output of the test app, when compiled with a high-quality math library, should look similar to this:

    Testing sinf ...  PASSED. max ulp err = 1.493253  diffsum = 337633490
    Testing cosf ...  PASSED. max ulp err = 1.495098  diffsum = 342020968
    

    With USE_FMA = 0 the accuracy changes slightly for the worse:

    Testing sinf ...  PASSED. max ulp err = 1.498012  diffsum = 359702532
    Testing cosf ...  PASSED. max ulp err = 1.504061  diffsum = 364682650
    

    The diffsum output is a rough indicator of overall accuracy, here showing that about 90% of all inputs result in a correctly rounded single-precision response.

    Note that it is important to compile the code with the strictest floating-point settings and highest degree of adherence to IEEE-754 the compiler offers. For the Intel compiler that I used to develop and test this code, that can be achieved by compiling with /fp:strict. Also, the quality of the math library used for reference is crucial for accurate assessment of the ulp error of this single-precision code. The Intel compiler comes with a math library that provides double-precision elementary math functions with just slightly over 0.5 ulp error in the HA (high accuracy) variant. Use of a multi-precision reference library may be preferable but would have slowed me down too much here.

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <string.h>   // for memcpy()
    #include <math.h>     // for test purposes, and when PORTABLE=1 or USE_FMA=1
    
    #define USE_FMA   (0) // use fmaf() calls for arithmetic
    #define PORTABLE  (0) // allow helper functions from math.h
    #define HAVE_U64  (0) // 64-bit integer type available
    #define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0
    
    #if USE_FMA
    #define SIN_RED_SWITCHOVER  (117435.992f)
    #define COS_RED_SWITCHOVER  (71476.0625f)
    #define MAX_DIFF            (1)
    #else // USE_FMA
    #if CW_STAGES == 2
    #define SIN_RED_SWITCHOVER  (3.921875f)
    #define COS_RED_SWITCHOVER  (3.921875f)
    #elif CW_STAGES == 3
    #define SIN_RED_SWITCHOVER  (201.15625f)
    #define COS_RED_SWITCHOVER  (142.90625f)
    #endif // CW_STAGES
    #define MAX_DIFF            (2)
    #endif // USE_FMA
    
    /* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
    uint32_t float_as_uint32 (float a)
    {
        uint32_t r;
        memcpy (&r, &a, sizeof r);
        return r;
    }
    
    /* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
    float uint32_as_float (uint32_t a)
    {
        float r;
        memcpy (&r, &a, sizeof r);
        return r;
    }
    
    /* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
    #if HAVE_U64
    uint32_t umul32_hi (uint32_t a, uint32_t b)
    {
        return (uint32_t)(((uint64_t)a * b) >> 32);
    }
    #else // HAVE_U64
    /* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
    uint32_t umul32_hi (uint32_t a, uint32_t b)
    {
        uint16_t a_lo = (uint16_t)a;
        uint16_t a_hi = a >> 16;
        uint16_t b_lo = (uint16_t)b;
        uint16_t b_hi = b >> 16;
        uint32_t p0 = (uint32_t)a_lo * b_lo;
        uint32_t p1 = (uint32_t)a_lo * b_hi;
        uint32_t p2 = (uint32_t)a_hi * b_lo;
        uint32_t p3 = (uint32_t)a_hi * b_hi;
        uint32_t t = (p0 >> 16) + p1;
        return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
    }
    #endif // HAVE_U64
    
    /* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
    const uint32_t two_over_pi_f [] = 
    {
        0x28be60db,
        0x9391054a,
        0x7f09d5f4,
        0x7d4d3770,
        0x36d8a566,
        0x4f10e410
    };
    
    /* Reduce a trig function argument using the slow Payne-Hanek method */
    float trig_red_slowpath_f (float a, int *quadrant)
    {
        uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
        int32_t e, q;
        float r;
    
    #if PORTABLE
        ia = (uint32_t)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32
    #else // PORTABLE
        ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
        e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
    #endif // PORTABLE
        
        /* compute product x * 2/pi in 2.62 fixed-point format */
        i = (uint32_t)e >> 5;
        e = (uint32_t)e & 31;
    
        hi  = i ? two_over_pi_f [i-1] : 0;
        mid = two_over_pi_f [i+0];
        lo  = two_over_pi_f [i+1];
        tmp = two_over_pi_f [i+2];
     
        if (e) {
            hi  = (hi  << e) | (mid >> (32 - e));
            mid = (mid << e) | (lo  >> (32 - e));
            lo  = (lo  << e) | (tmp >> (32 - e));
        }
    
        /* compute 64-bit product phi:plo */
        phi = 0;
        l = ia * lo;
        h = umul32_hi (ia, lo);
        plo = phi + l;
        phi = h + (plo < l);
        l = ia * mid;
        h = umul32_hi (ia, mid);
        plo = phi + l;
        phi = h + (plo < l);
        l = ia * hi;
        phi = phi + l;
    
        /* split fixed-point result into integer and fraction portions */
        q = phi >> 30;               // integral portion = quadrant<1:0>
        phi = phi & 0x3fffffff;      // fraction
        if (phi & 0x20000000) {      // fraction >= 0.5
            phi = phi - 0x40000000;  // fraction - 1.0
            q = q + 1;
        }
    
        /* compute remainder of x / (pi/2) */
    #if USE_FMA
        float phif, plof, chif, clof, thif, tlof;
        phif = 1.34217728e+8f * (float)(int32_t)(phi & 0xffffffe0); // 0x1.0p27
        plof = (float)((plo >> 5) | (phi << (32-5)));
        thif = phif + plof;
        plof = (phif - thif) + plof;
        phif = thif;
        chif =  1.08995894e-17f; //  0x1.921fb6p-57 // (1.5707963267948966 * 0x1.0p-57)_hi 
        clof = -3.03308686e-25f; // -0x1.777a5cp-82 // (1.5707963267948966 * 0x1.0p-57)_lo
        thif = phif * chif;
        tlof = fmaf (phif, chif, -thif);
        tlof = fmaf (phif, clof, tlof);
        tlof = fmaf (plof, chif, tlof);
        r = thif + tlof;
    #else // USE_FMA
        /* record sign of fraction */
        uint32_t s = phi & 0x80000000;
        
        /* take absolute value of fraction */
        if ((int32_t)phi < 0) {
            phi = ~phi;
            plo = 0 - plo;
            phi += (plo == 0);
        }
        
        /* normalize fraction */
        e = 0;
        while ((int32_t)phi > 0) {
            phi = (phi << 1) | (plo >> 31);
            plo = plo << 1;
            e--;
        }
        
        /* multiply 32 high-order bits of fraction with pi/2 */
        phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
        
        /* normalize product */
        if ((int32_t)phi > 0) {
            phi = phi << 1;
            e--;
        }
    
        /* round and convert to floating point */
        uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
        r = uint32_as_float (ri);
    #endif // USE_FMA
        if (a < 0.0f) {
            r = -r;
            q = -q;
        }
    
        *quadrant = q;
        return r;
    }
    
    /* Argument reduction for trigonometric functions that reduces the argument
       to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns 
       -0.0f for an input of -0.0f 
    */
    float trig_red_f (float a, float switch_over, int *q)
    {    
        float j, r;
    
        if (fabsf (a) > switch_over) {
            /* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
               for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
            */
            r = trig_red_slowpath_f (a, q);
        } else {
            /* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
               for the Elementary Functions", Prentice-Hall 1980
            */
    #if USE_FMA
            j = fmaf (a, 6.36619747e-1f, 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
            j = j - 1.25829120e+7f; // 0x1.8p+23
            r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
            r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
            r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
    #else // USE_FMA
            j = (a * 6.36619747e-1f + 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
            j = j - 1.25829120e+7f; // 0x1.8p+23
    #if CW_STAGES == 2
            r = a - j * 1.57079625e+00f; // 0x1.921fb4p+0  // pio2_high 
            r = r - j * 7.54979013e-08f; // 0x1.4442d2p-24 // pio2_low
    #elif CW_STAGES == 3
            r = a - j * 1.57078552e+00f; // 0x1.921f00p+00 // pio2_high
            r = r - j * 1.08043314e-05f; // 0x1.6a8880p-17 // pio2_mid
            r = r - j * 2.56334407e-12f; // 0x1.68c234p-39 // pio2_low
    #endif // CW_STAGES
    #endif // USE_FMA
            *q = (int)j;
        }
        return r;
    }
    
    /* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
       Returns -0.0f for an argument of -0.0f
       Polynomial approximation based on T. Myklebust, "Computing accurate 
       Horner form approximations to special functions in finite precision
       arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
    */
    float sinf_poly (float a, float s)
    {
        float r, t;
    #if USE_FMA
        r =              2.86567956e-6f;  //  0x1.80a000p-19 
        r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
        r = fmaf (r, s,  8.33338592e-3f); //  0x1.111182p-07
        r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
        t = fmaf (a, s, 0.0f); // ensure -0 is passed through
        r = fmaf (r, t, a);
    #else // USE_FMA
        r =         2.86567956e-6f; //  0x1.80a000p-19
        r = r * s - 1.98559923e-4f; // -0x1.a0690cp-13
        r = r * s + 8.33338592e-3f; //  0x1.111182p-07
        r = r * s - 1.66666672e-1f; // -0x1.555556p-03
        t = a * s + 0.0f; // ensure -0 is passed through
        r = r * t + a;
    #endif // USE_FMA
        return r;
    }
    
    /* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
    float cosf_poly (float s)
    {
        float r;
    #if USE_FMA
        r =              2.44677067e-5f;  //  0x1.9a8000p-16
        r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
        r = fmaf (r, s,  4.16666567e-2f); //  0x1.555550p-05
        r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
        r = fmaf (r, s,  1.00000000e+0f); //  0x1.000000p+00
    #else // USE_FMA
        r =         2.44677067e-5f; //  0x1.9a8000p-16
        r = r * s - 1.38877297e-3f; // -0x1.6c0efap-10
        r = r * s + 4.16666567e-2f; //  0x1.555550p-05
        r = r * s - 5.00000000e-1f; // -0x1.000000p-01
        r = r * s + 1.00000000e+0f; //  0x1.000000p+00
    #endif // USE_FMA
        return r;
    }
    
    /* Map sine or cosine value based on quadrant */
    float sinf_cosf_core (float a, int i)
    {
        float r, s;
    
        s = a * a;
        r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
        if (i & 2) {
            r = 0.0f - r; // don't change "sign" of NaNs
        }
        return r;
    }
    
    /* maximum ulp error with USE_FMA = 1: 1.495098  */
    float my_sinf (float a)
    {
        float r;
        int i;
    
        a = a * 0.0f + a; // inf -> NaN
        r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
        r = sinf_cosf_core (r, i);
        return r;
    }
    
    /* maximum ulp error with USE_FMA = 1: 1.493253 */
    float my_cosf (float a)
    {
        float r;
        int i;
    
        a = a * 0.0f + a; // inf -> NaN
        r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
        r = sinf_cosf_core (r, i + 1);
        return r;
    }
    
    /* re-interpret bit pattern of an IEEE-754 double as a uint64 */
    uint64_t double_as_uint64 (double a)
    {
        uint64_t r;
        memcpy (&r, &a, sizeof r);
        return r;
    }
    
    double floatUlpErr (float res, double ref)
    {
        uint64_t i, j, err, refi;
        int expoRef;
        
        /* ulp error cannot be computed if either operand is NaN, infinity, zero */
        if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
            (res == 0.0f) || (ref == 0.0f)) {
            return 0.0;
        }
        /* Convert the float result to an "extended float". This is like a float
           with 56 instead of 24 effective mantissa bits.
        */
        i = ((uint64_t)float_as_uint32(res)) << 32;
        /* Convert the double reference to an "extended float". If the reference is
           >= 2^129, we need to clamp to the maximum "extended float". If reference
           is < 2^-126, we need to denormalize because of the float types's limited
           exponent range.
        */
        refi = double_as_uint64(ref);
        expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
        if (expoRef >= 129) {
            j = 0x7fffffffffffffffULL;
        } else if (expoRef < -126) {
            j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
            j = j >> (-(expoRef + 126));
        } else {
            j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
            j = j | ((uint64_t)(expoRef + 127) << 55);
        }
        j = j | (refi & 0x8000000000000000ULL);
        err = (i < j) ? (j - i) : (i - j);
        return err / 4294967296.0;
    }
    
    int main (void) 
    {
        float arg, res, reff;
        uint32_t argi, resi, refi;
        int64_t diff, diffsum;
        double ref, ulp, maxulp;
    
        printf ("Testing sinf ...  ");
        diffsum = 0;
        maxulp = 0;
        argi = 0;
        do {
            arg = uint32_as_float (argi);
            res = my_sinf (arg);
            ref = sin ((double)arg);
            reff = (float)ref;
            resi = float_as_uint32 (res);
            refi = float_as_uint32 (reff);
            ulp = floatUlpErr (res, ref);
            if (ulp > maxulp) {
                maxulp = ulp;
            }
            diff = (resi > refi) ? (resi - refi) : (refi - resi);
            if (diff > MAX_DIFF) {
                printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
                return EXIT_FAILURE;
            }
            diffsum = diffsum + diff;
            argi++;
        } while (argi);
        printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);
    
        printf ("Testing cosf ...  ");
        diffsum = 0;
        maxulp = 0;
        argi = 0;
        do {
            arg = uint32_as_float (argi);
            res = my_cosf (arg);
            ref = cos ((double)arg);
            reff = (float)ref;
            resi = float_as_uint32 (res);
            refi = float_as_uint32 (reff);
            ulp = floatUlpErr (res, ref);
            if (ulp > maxulp) {
                maxulp = ulp;
            }
            diff = (resi > refi) ? (resi - refi) : (refi - resi);
            if (diff > MAX_DIFF) {
                printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e)  ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
                return EXIT_FAILURE;
            }
            diffsum = diffsum + diff;
            argi++;
        } while (argi);
        diffsum = diffsum + diff;
        printf ("PASSED. max ulp err = %.6f  diffsum = %lld\n", maxulp, diffsum);
        return EXIT_SUCCESS;
    }