Search code examples
assemblyfloating-pointnasmmultiplicationdenormal-numbers

denormalize the product of 2 floating point numbers or not


I'm trying to multiply 2 floating point numbers without using the floating point instructions. Everything was going well until I came across denormalized numbers. How do I know whether I should normalize or denormalize the product? This uncertainty makes rounding the product hard. My intuition tells me that the product should be denormalized if both factors are denormalized numbers.


Solution

  • Subnormal numbers are very close to zero. For a subnormal x, x^2 has about half the unbiased exponent, and that's way too small for even a subnormal to represent. (Even if x was the largest subnormal, i.e. nextafter(FLT_MIN, -INF). Things are similar for any two subnormal numbers.

    The product of two subnormal numbers always fully underflows to + or -0.0.

    The result of any operation should always be normalized if possible. The only time it's not possible is when the exponent would be too small, then you have subnormal (aka denormal) numbers give you gradual underflow by leaving leading bits of the mantissa as zero, for the minimum exponent value. https://en.wikipedia.org/wiki/Single-precision_floating-point_format explains subnormal numbers in general pretty well.

    This is a general rule for floating point, always: IEEE754 formats like binary32 and binary64 leave no choice in how to represent any given finite value. A non-zero exponent encoding implies a leading 1 in the mantissa, so you can't have a denormalized float or double except for subnormal. The x87 80-bit extended-precision format has all its mantissa bits stored explicitly, so it's possible to encode a number with a non-zero exponent but leading zeros in the mantissa. However, hardware may even consider that invalid, and you should definitely never do it because it means throwing away more mantissa bits than necessary (if this was a multiply).

    Addition or subtraction can also produce subnormal numbers, if the signs differ/match respectively. e.g. nextafter(FLT_MIN, +INFINITY) - FLT_MIN cancels all but the lowest mantissa bit (an example of "catastrophic cancellation"), leaving a number too small to be represented as a normalized float.