algorithm floating-point precision calculation

How can I implement a fmod function with decent precision without casting them into double/long double?

I'm trying to implement a fmodf function in assembly. I used the formula as follows.

fmodf(float x,float y) = x - trunc(x / y) * y

The problem is, when x>>y the precision of the function is getting worse. In fact,when abs(x/y)>2^23, trunc will not get rid of the decimal part because it got no decimal part.(which I know that the mantissa of IEEE-754 floating-point standard is 23 bits)

I have looked into the implementation of fmod in cmath, which just used __built_in_remainderl which is basically used long double instead of float, which I cannot use in my assembly(it does not support double or long double)

Solution

fmod() can be implemented in a straightforward way via simple binary longhand division if performance is not critical. Longhand division is the division process we all learned in grade school, generating one quotient digit per step. It works in the same way when using base two, where it generates one bit of quotient per step. As with any longhand division, it delivers the exact remainder as required by the fmod() specification in ISO-C and ISO-C++.

The binary division process can be performed in integer arithmetic or floating-point arithmetic. Asker did not indicate details of their platform, but the mention of support for single-precision only computation suggests it could be a processor with low-latency hardware support for single-precision IEEE-754 arithmetic using the binary32 format, such as an ARM Cortex-M4. In the ISO C99 code below I am therefore demonstrating floating-point based computation.

The code in my_fmodf() starts out by handling a few special cases involving NaNs, infinities, and zeros. If the magnitude of the divisor is greater than the magnitude of the dividend, the result is the dividend. Otherwise we have to prepare for the division process by left-aligning the divisor, i.e. bringing the divisor to the same order of magnitude as the dividend. In each division step the next quotient bit is based on a comparison of the partial remainder with the divisor. If it is greater than or equal to the divisor, the quotient bit is one, and we need to subtract the divisor from the partial remainder. Since the quotient itself is never needed for fmod(), it is not recorded.

The code assumes that float maps to the IEEE-754 binary32 format and that the storage conventions used for 32-bit integer data and 32-bit floating-point data are identical. It was developed and "smoke"-tested on a platform that supports subnormal operands, so it may or may not work properly in a non-IEEE754 flush-to-zero (FTZ) regime. I built with the Intel C/C++ compiler using /fp:strict for strictest adherence to IEEE-754 floating point semantics.

The C code below can be translated to assembly language pretty much one to one. frexpf() and ldexpf() are used to manipulate the exponent field of floating-point numbers and for the limited functionality used here can be replaced with a few assembly instructions each on common processor architectures. Similar isnan() and isinf() map to a simple inspection of the bit pattern underlying floating-point operands. Similarly, fabsf() and copysignf(), if not directly supported by hardware instructions, are simple manipulations of the sign bit of floating-point operands with integer logic instructions.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

/* returns the floating-point remainder of a/b (rounded towards zero) */
float my_fmodf (float a, float b)
{
    const float NAN_INDEFINITE = uint32_as_float (0xffc00000);
    float r;
    if (isnan (a) || isnan (b)) {
        r = a + b;
    } else if (isinf (a) || (b == 0.0f)) {
        r = NAN_INDEFINITE;
    } else {
        float fa, fb, dividend, divisor;
        int expo_a, expo_b;
        fa = fabsf (a);
        fb = fabsf (b);
        if (fa >= fb) {
            dividend = fa;
            /* normalize divisor */
            (void)frexpf (fa, &expo_a);
            (void)frexpf (fb, &expo_b);
            divisor = ldexpf (fb, expo_a - expo_b);
            if (divisor <= 0.5f * dividend) {
                divisor += divisor;
            }
            /* compute quotient one bit at a time */
            while (divisor >= fb) {
                if (dividend >= divisor) {
                    dividend -= divisor;
                }
                divisor *= 0.5f;
            }
            /* dividend now represents remainder */
            r = copysignf (dividend, a);
        } else {
            r = a;
        }
    }
    return r;
}

// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC  ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
              kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    double a, b, res, ref;
    uint64_t i = 0;
    do {
        a = uint32_as_float (KISS);
        b = uint32_as_float (KISS);
        res = my_fmodf (a, b);
        ref = fmodf (a, b);

        if (float_as_uint32 (res) != float_as_uint32 (ref)) {
            printf ("error: a=% 15.8e b=% 15.8e res=% 15.8e ref=% 15.8e\n", 
                    a, b, res, ref);
            return EXIT_FAILURE;
        }
        i++;
        if (!(i & 0xfffff)) printf ("\r%llu", i);
    } while (i);
    return EXIT_SUCCESS;
}