Search code examples
c#floating-pointieee-754fixed-pointdeterministic

How to convert IEEE754 float to fixed-point in deterministic way?


I need to convert a 32 bit IEEE754 float to a signed Q19.12 fixed-point format. The problem is that it must be done in a fully deterministic way, so the usual (int)(f * (1 << FRACTION_SHIFT)) is not suitable, since it uses non-deterministic floating point math. Are there any "bit fiddling" or similar deterministic conversion methods?

Edit: Deterministic in this case is assumed as: given the same floating point data achieve exactly same conversion results on different platforms.


Solution

  • While @StephenCanon's answer might be right about this particular case being fully deterministic, I've decided to stay on the safer side, and still do the conversion manually. This is the code I have ended up with (thanks to @CodesInChaos for pointers on how to do this):

    public static Fixed FromFloatSafe(float f) {
        // Extract float bits
        uint fb = BitConverter.ToUInt32(BitConverter.GetBytes(f), 0);
        uint sign = (uint)((int)fb >> 31);
        uint exponent = (fb >> 23) & 0xFF;
        uint mantissa = (fb & 0x007FFFFF);
    
        // Check for Infinity, SNaN, QNaN
        if (exponent == 255) {
            throw new ArgumentException();
        // Add mantissa's assumed leading 1
        } else if (exponent != 0) {
            mantissa |= 0x800000;
        }
    
        // Mantissa with adjusted sign
        int raw = (int)((mantissa ^ sign) - sign);
        // Required float's radix point shift to convert to fixed point
        int shift = (int)exponent - 127 - FRACTION_SHIFT + 1;
    
        // Do the shifting and check for overflows
        if (shift > 30) {
            throw new OverflowException();
        } else if (shift > 0) {
            long ul = (long)raw << shift;
            if (ul > int.MaxValue) {
                throw new OverflowException();
            }
            if (ul < int.MinValue) {
                throw new OverflowException();
            }
            raw = (int)ul;
        } else {
            raw = raw >> -shift;
        }
    
        return Fixed.FromRaw(raw);
    }