floating-point precision floating-accuracy

Is the leftmost bit of the mantissa always 1?

I am reading the book „the secret life of programs“. In the first chapter there is an explanation of a trick invented by Digital Equipment Corporation (DEC) that allows to double accuracy: „throwing away the leftmost bit of the mantissa doubles the accuracy, since we know that it will always be 1, which makes room for one more bit.“

I cannot understand. As an example: Let‘s consider a 4-bit floating-point representation with 2 bits of mantissa and 2 bits of exponent. The binary number 01.11 then represents 4.0. The leftmost bit is 0 and not 1.

Can anyone explain to me, what is meant by „throwing away the leftmost bit“ and „ doubling accuracy“ with a simple example?

Solution

Floating-point formats differ, but, for the most part, a floating-point representation is s•b^e•f, where s is +1 or −1 to represent a sign, b is a base (usually 2 or 10), e is an exponent, and f is a p-digit numeral in base b. b and p are fixed numbers determined by the format. s, e, and f vary according to the number represented. f is called the significand. (This is the preferred term for the fraction portion of a floating-point number. “Mantissa” is a historic term for the fraction portion of a logarithm.)

Bounds on e and how f is scaled are determined by the format. Sometimes f is specified to have a radix point (the general form of a decimal point) after its first digit, so its form is d₀.d₋₁d₋₂…d_1−p, where each d_i is a base-b digit (an integer from 0 to b−1, inclusive). In this form, 0 ≤ f < b. (We do not need to be concerned with other forms in this answer.)

For the most part, there is no need for the first digit of f to be zero. If it were zero, we could remove it and shift the other digits left one position. This multiplies f by b, but we can compensate by decreasing the exponent e by one. For example, if the floating-point number were 0.111₂•2⁷, we could instead represent it as 1.110₂•2⁶. This opens up room for a new digit at the end of f, which gives us better precision than if the first digit is zero.

Thus, we prefer to have the first digit be non-zero. This is called normal form. A number with a zero first digit is denormalized. At the very “bottom” end of the representable numbers, e reaches its bound, and we cannot decrease it further, so we cannot shift the digits of f left, and we have to leave a zero in the first position. These numbers, which are necessarily denormalized, are subnormal numbers. Some floating-point formats support them. Some do not.

When the base is two, there are only two possible digit values (0 and 1), so, if the first digit is not zero, it must be one. Therefore, when we are using base two and a number is normal form, we know its first digit is one. Therefore, there is no need to record this digit along with the others if we know the number is normal.

In IEEE-754 formats, the bits used to record the exponent also record whether the number is normal or not. The numbers s, e, and f above are stored in fields of bits we might call S, E, and F. However, they are not directly stored as binary integers in these bits. In an exponent field E, if the bits are all zeros, that indicates a subnormal number. This tells us both that the exponent e is at its lowest value and that the first digit of the significand is zero. If bits are all ones, that indicates the item is an infinity or a NaN. All values of E between all zeros and all ones indicate values of e and also that the first digit of the significand is one. (The values of E are not directly the values of e; there is a bias used to adjust, so that e = E − bias. For example, IEEE-754 binary32 has a bias of 127, so an E of 00000001 represents e = 1 − 127 = −126.)

Thus, it is incorrect to say we “throw away“ the leftmost bit of the significand. What actually happens is that the leftmost bit is encoded in the exponent field, and the rightmost bits are stored in the significand field.

It is also sloppy to say that handling the leftmost bit this way “doubles the accuracy.” It is true that a significand with p bits instead of p−1 bits has twice as many possible values, because each bit doubles the number of values representable. So encoding one bit via the exponent field gives us twice as many possible values than if we had to use only the field F for the significand. But it is clearer to describe that as one more bit of accuracy rather than a doubling of accuracy.