Search code examples
floating-pointeps

Normalised and denormalised floating points


I don't understand why a denormalised number is alway smaller than a normalised one. I remember that normalisation simply means that we have only one digit different from zero before the comma (for example 1.110). Lets say for example that I have a floating system with 16-bit precision, then the exponent has 5 bits, the mantissa 10 plus the hidden bit. The bias is then 15. If I want to take the minimum possible number, I have 1 - 15 = -14 as exponent (because 0 is reserved). Then the normalised minimum number is 2^(-14), but the denormalised one is calculated multiplying the the normalised minimum number by the epsilon machine, that is 2^(-14)*2^(-10). I don't understand why we multiply the minimum by the epsilon machine, and why it is considered as a denormalised number. Thank you


Solution

  • A floating-point representation of a finite number is sfbe, where:

    • b is a fixed base (a base determined by the format, not varying with the represented values),
    • s is a sign (+1 or −1),
    • f is a p-digit number (called the significand or fraction) in base b (where p is a fixed precision),
    • e is an exponent satisfying emineemax, where emin and emax are fixed bounds.

    f is often described as a p-digit number with a decimal point or “radix point” after the first digit, such as 1.011 or as a p-digit integer with no decimal point or, equivalently, a decimal pointer after all the digits, such as 1011. These are completely equivalent when e is adjusted for the difference between them, p−1 digits, along with adjustments in emin and emax.

    In the first case, f may be written explicitly as ∑i=0..p−1 fi • bi, where f0, f1, and so on are the digits of f.

    If f0 is not zero, the number is in normal form. If f0 is zero, the form of the number is denormalized.

    If e > emin, we can replace f with fb and replace e with e−1, and these new values will represent the same number. If the new first digit is still zero, but e is large enough and not all the digits are zero, then we an repeat this until we get a first digit that is not zero. This is called normalization.

    If we cannot make the first digit non-zero because e would be less than emin, the number is below the numbers that can be represented in normal form, so it is said to be subnormal. A number that is subnormal cannot be normalized, because it is too small. A number in a form that is denormalized but is not subnormal can be normalized.

    In the past, denormal was used to refer to subnormal numbers. However, for clarity, we distinguish these words. A subnormal number is always smaller than a normal number, because the definition of subnormal is that it is a number too small to be represented in normal form. However, it is possible to have denormalized forms that are larger than normalized forms, simply because they have larger exponents, by enough to make up for the difference in the fraction.

    IEEE-754 binary formats do not allow denormalized numbers in their binary encodings. IEEE-754 decimal formats do, and non-IEEE-754 formats may also.

    In the 16-bit format you describe, b is 2, p is 11, emin is −14, and emax is 15. The minimum normal positive number has exponent −14 and significand 1.00000000002, so its form is +1 • 1.00000000002 • 2−14, which represents the value 2−14.

    The minimum positive number has exponent −14 and significand 0.00000000012, so its form is +1 • 0.00000000012 • 2−14, which represents the value 2−10 • 2−14 = 2−24. The 2−10 arises because that is where the 1 digit is relative to the start of f.

    Note that all of the above is defined purely in terms of the form of the floating-point representation. The bits that encode the floating-point representation are not mentioned and are irrelevant. The facts that we encode the exponent using a bias of 15 or that we can encode the significand mostly in ten bits with one bit determined from the exponent field are irrelevant. Once we know b, p, and emin, we can figure out the smallest normal positive number and the smallest positive number without knowing anything about the encoding method.