Search code examples
c#monossesimdieee-754

Floating point range reduction


I'm implementing some 32-bit float trigonometry in C# using Mono, hopefully utilizing Mono.Simd. I'm only missing solid range reduction currently. I'm rather stuck now, because apparently Mono's SIMD extensions does not include conversions between floats and integers, meaning I have no access to rounding/truncation which would be the usual method. I can however convert bitwise between ints and floats.

Can something like this be done? I can scale the domain up and down if needed, but ideally the range reduction should result in a domain of [0, 2 pi] or [-pi, pi]. I have a hunch that it would be possible to do some IEEE magic with the exponent, if the domain is a power of 2, but I'm really not sure how to.

Edit: Okay, I've tried messing around with this C code and it feels like I'm on the verge of something (it doesn't work but the fractional part is always correct, in decimal / base10 at least...). The core principle seems to be getting the exponent difference between your domain and the input exponent, and composing a new float with a shifted mantissa and an adjusted exponent.. But it won't work for negatives, and I have no idea how to handle non-powers of 2 (or anything fractional - in fact, anything else than 2 doesn't work!).

// here's another more correct attempt:
float fmodulus(float val, int domain)
{
    const int mantissaMask = 0x7FFFFF;
    const int exponentMask = 0x7F800000;

    int ival = *(int*)&val;

    int mantissa = ival & mantissaMask;
    int rawExponent = ival & exponentMask;
    int exponent = (rawExponent >> 23) - (129 - domain);
    // powers over one:
    int p = exponent;

    mantissa <<= p;
    rawExponent = exponent >> p;
    rawExponent += 127;
    rawExponent <<= 23;

    int newVal = rawExponent & exponentMask;
    newVal |= mantissa & mantissaMask;

    float ret = *(float*)&newVal;
    
    return ret;
}

float range_reduce(float value, int range )
{
    const int mantissaMask = 0x7FFFFF;
    const int exponentMask = 0x7F800000;

    int ival = *(int*)&value;
    // grab exponent:
    unsigned exponent = (ival & exponentMask) >> 23;
    // grab mantissa:
    unsigned mantissa = ival & mantissaMask;

    // remove bias, and see how much the exponent is over range/domain
    unsigned char erange = (unsigned char)(exponent - (125 + range));
    // check if sign bit is set - that is, the exponent is under our range
    if (erange & 0x80)
    {
        // don't do anything then.
        erange = 0;
    }

    // shift mantissa (and chop off bits) by the reduced amount
    int inewVal = (mantissa << (erange)) & mantissaMask;
    // add exponent, and subtract the amount we reduced the argument with
    inewVal |= ((exponent - erange) << 23) & exponentMask;

    // reinterpret
    float newValue = *(float*)&inewVal;
    return newValue;
    //return newValue - ((erange) & 0x1 ? 1.0f : 0.0f);
}

int main()
{
    float val = 2.687f;
    int ival = *(int*)&val;
    float correct = fmod(val, 2);
    float own = range_reduce(val, 2);

    getc(stdin);
}

Edit 2:

Okay, I'm really trying to understand this in terms of the ieee binary system. If we write the modulus operation like this:

output = input % 2

[exponent] + [mantissa_bit_n_times_exponent]

3.5     = [2] + [1 + 0.5]                   ->[1] + [0.5]       = 1.5
4.5     = [4] + [0 + 0 + 0.5]               ->[0.5] + [0]       = 0.5
5.5     = [4] + [0 + 1 + 0.5]               ->[1] + [0.5]       = 1.5
2.5     = [2] + [0 + 0.5]                   ->[0.5] + [0]       = 0.5
2.25    = [2] + [0 + 0 + 0.25]              ->[0.25]            = 0.25
2.375   = [2] + [0 + 0 + 0.25 + 0.125]      ->[0.25] + [0.125]  = 0.375
13.5    = [8] + [4 + 0 + 1 + 0.5]           ->[1] + [0.5]       = 1.5
56.5    = [32] + [16 + 8 + 0 + 0 + 0 + 0.5] ->[0.5]             = 0.5

We can see the output in all cases is a new number, with no original exponent and the mantissa shifted an amount (that is based on the exponent and the first non-zero bits of the mantissa after the first exponent-bits of the mantissa is ignored) into the exponent. But I'm not really sure if this is the correct approach, it just works out nicely on paper.

Edit3: I'm stuck on Mono version 2.0.50727.1433


Solution

  • You can reduce the problem to taking a float mod 1. To simplify that, you can compute the floor of the float using bit operations, then use a floating point subtraction. The following is (unsafe) C# code for these operations:

    // domain is assumed to be positive
    // returns value in [0,domain)
    public float fmodulus(float val, float domain)
    {
        if (val < 0)
        {
            float negative = fmodulus(-val, domain);
            if (domain - negative == domain)
                return 0;
            else
                return domain-negative;
        }
    
        if (val < domain)
            return val; // this avoids losing accuracy
    
        return fmodOne(val / domain) * domain;
    }
    
    // assumes val >= 1, so val is positive and the exponent is at least 0 
    unsafe public float fmodOne(float val)
    {
        int iVal = *(int*)&val;
        int uncenteredExponent = iVal >> 23;
        int exponent = uncenteredExponent - 127; // 127 corresponds to 2^0 times the mantissa
        if (exponent >= 23) 
            return 0; // not enough precision to distinguish val from an integer
    
        int unneededBits = 23 - exponent; // between 0 and 23
        int iFloorVal = (iVal >> unneededBits) << unneededBits; // equivalent to using a mask to zero the bottom bits of the mantissa
        float floorVal = *(float*)&iFloorVal; // convert the bit pattern back to a float
    
        return val-floorVal;
    }
    

    For example, fmodulus(100.1f, 1) is 0.09999847. The bit pattern of 100.1f is

    0 10000101 10010000011001100110011

    The bit pattern of floorVal (100f) is

    0 10000101 10010000000000000000000

    A floating point subtraction gives something close to 0.1f:

    0 01111011 10011001100110000000000

    Actually, I was surprised that the last 8 bits were zeroed out. I thought only the last 6 bits of 0.1f were supposed to be replaced with 0. Perhaps one can do better than relying on the floating point subtraction.