Search code examples
ccastingintegerfloating-point

Rules for Explicit int32 -> float32 Casting


I have a homework assignment to emulate floating point casts, e.g.:

int y = /* ... */;
float x = (float)(y);

. . . but obviously without using casting. That's fine, and I wouldn't have a problem, except I can't find any specific, concrete definition of how exactly such casts are supposed to operate.

I have written an implementation that works fairly well, but it doesn't quite match up occasionally (for example, it might put a value of three in the exponent and fill the mantissa with ones, but the "ground truth" will have a value of four in the exponent and fill the mantissa with zeroes). The fact that the two are equivalent (sorta, by infinite series) is frustrating because the bit pattern is still "wrong".

Sure, I get vague things, like "round toward zero" from scattered websites, but honestly my searches keep getting clogged C newbie questions (e.g., "What's a cast?", "When do I use it?"). So, I can't find a general rule that works for explicitly defining the exponent and the mantissa.

Help?


Solution

  • Since this is homework, I'll just post some notes about what I think is the tricky part - rounding when the magnitude of the integer is larger than the precision of the float will hold. It sounds like you already have a solution for the basics of obtaining the exponent and mantissa already.

    I'll assume that your float representation is IEEE 754, and that rounding is performed the same way that MSVC and MinGW do: using a "banker's rounding" scheme (I'm honestly not sure if that particular rounding scheme is required by the standard; it's what I tested against though). The remaining discussion assumes the int to be converted in greater than 0. Negative numbers can be handled by dealing with their absolute value and setting the sign bit at the end. Of course, 0 needs to be handled specially in any case (because there's no msb to find).

    Since there are 24 bits of precision in the mantissa (including the implied 1 for the msb), ints up to 16777215 (or 0x00ffffff) can be represented exactly. There's nothing particularly special to do other than the bit shifting to get things in the right place and calculating the correct exponent depending on the shifts.

    However, if there are more than 24 bits of precision in the int value, you'll need to round. I performed the rounding using these steps:

    • If the msb of the dropped bits is 0, nothing more needs to be done. The mantissa and exponent can be left alone.
    • if the msb of the dropped bits is 1, and the remaining dropped bits have one or more bits set, the mantissa needs to be incremented. If the mantissa overflows (beyond 24 bits, assuming you haven't already dropped the implied msb), then the mantissa needs to be shifted right, and the exponent incremented.
    • if the msb of the dropped bits is one, and the remaining dropped bits are all 0, then the mantissa is incremented only if the lsb is 1. Handle overflow of the mantissa similarly to case 2.

    Since the mantissa increment will overflow only when it's all 1's, if you're not carrying around the mantissa's msb (i.e., if you've already dropped it since it'll be dropped in the ultimate float representation), then the cases where the mantissa increment overflows can be fixed up simply by setting the mantissa to zero and incrementing the exponent.