Search code examples
floating-pointfloating-point-conversion

Algorithm for Floating point conversion format IEEE


Assuming i have a floating point (fp) of a given format FA (i.e. with his exponent size, mantissa size), and more specifically something like

FA fa;

and suppose i wanted this to a format FB with an operation FA2FB, which gives a floating point number fb, i.e. something like

FB fb = FA_2_FB(fa);

At best of your knowledge does the IEEE standard provide a general way to perform this casting? (it could be a narrowing a widening or simply a format change, with the same amount of bits.

Is it clear my question? If the standard doesn't provide anything i will specify which cases i'm considering.


Solution

  • I am copying the terminology from the Wikipedia article IEEE floating point

    I think the best approach to this is to split it into four problems:

    1. Identify NaN and infinite input, and directly generate the corresponding bit pattern in the destination format.
    2. Given a number input, extract the sign, significand, and exponent
    3. Check for overflow and subnormal in the new format. If overflow, generate the appropriate infinity. If subnormal, calculate the number of bits to keep in the significand.
    4. Pack into the new format. That may require rounding if the new significand has fewer bits than the old one.

    You will need to select one of the standard rounding modes. The simplest is rounding towards zero, which is simple truncation. However, I recommend round to nearest with the midpoint rounded to even. For that, you need to look both at the value of the first dropped bit, and also whether there are any non-zero bits beyond it.