binary floating-point rounding ieee-754 divide

Please explain theorem 7 in Goldberg 91

Theorem 7

When β = 2, if m and n are integers with |m| < 2^(p - 1) and n has the special form n = 2^i + 2^j, then (m n) n = m, provided floating-point operations are exactly rounded.

If I post the whole proof here, it will be long, unreadable and ugly. So please click the link on the right, press ctrl + F, and find Theorem 7. There it is! Goldberg91

OK, I have to say, at least from my own perspective, the proof of theorem 7 is too weird to comprehend, although the author claims that it's ingenious.

What I can only understand is that m has at most 1 bit right of the binary point. Yes, I know that, but why so n*qbar will round to m consequently? I can't also understand the so-called "halfway case" and almost everything from that line on.

Any help is welcome, thank you in advance.

EDIT:

Interestingly, the first comment below has solved all my questions in a roll, and the second comment suggests me to narrow my post.

That's right. It's inhuman to ask a person to explain the whole proof. So now my question becomes this:

Why the initial unscaled m's low-order bit was 0(found in the paragraph below formula 9)? Shouldn't the most significant digit be zero instead of the least significant one? Does this have something to do with 'Big-endian' or 'Little-endian'?

Solution

First off, let's scale n by 2. n should be made greater than or equal to 2^p - 1, and less than 2^p. The scaled n will be donated by n'. Scaling won't make any difference. It is exponent that is modified, since numbers are in binary, and we just have to focus on the significand / mantissa.

Next, m is scaled, producing m', so that q' = m'/n', less than 1 and greater than 1/2 (In fact, I think there should be 1/2 < q' ≤ 1). Scaling like these are possible because the upper bounds are exactly double the lower bounds, and β = 2.

As we can see, 2^p - 1 ≤ n' < 2^p and 1/2 < q' < 1, while m' = q' * n'. Since q' and n' are positive, the maximum of m' is the product of n' and q''s maximums, that is 2^p * 1 = 2^p. Similarly, the minimum of m' is 2^p - 1 * 1/2 = 2^p - 2.

Now that 2^p - 2 < m' < 2^p, we can say p - 2 < log2(m') < p, so the number of digits to the right of the binary point will be either p-1 (this happens when log2(m') is between p-2 and p-1) or p (when log2(m') is greater than or equal to p-1). Thus, m' has at most one bit to the right of the binary point.

As Mark Dickinson has said, as a result, the difference between m' and the next precision-p float up/down is at least 1/2. So to show that a quantity will round to m', it's enough to show that it's within 1/4 of m'.

Besides that, "the halfway case", namely the case where that quantity is exactly 1/4 from m', worth a separate discussion: Since the initial unscaled m had |m| < 2*p - 1, it has at least one bit to the right of the binary point, due to the same reason as mentioned above. m is an integer, so all digits to the right of the binary point are zero. Certainly, its low-order bit is 0 for that reason. Because scaling doesn't have effect on the significand / mantissa, the low-order bit of m' is also 0.

Consequently, using round to even, which is adopted by the original author by writing "Throughout the rest of this paper, round to even will be used." above (You can use ctrl + F again to find it), m' + 1/4 (0.01 in binary) will round to m', since 0 is even.

That is, if q̄ = m n, to prove the theorem requires showing that

|n' * q̄ - m'| ≤ 1/4.

Update 1

q' is a rational number, and q' < 1, so we can suppose that q' = 0.q1 q2... in binary, where qi, where i= 1,2,3... are single digits that contains 0 or 1. Let q̂ = 0.q1 q2 ... qp 1. Be careful, this token is "q-hat", not "q-bar", and it's a new variable I've just introduced in.

Now, if we shift q̂ left by p + 1 digits, we'll get an integer, namely q1 q2 ... qp 1, because all bits are to the left of the binary point. I'll use N to donate this integer hereafter. As a result, |q̂ - q'| = |N / 2^(p + 1) - m' / n'|.

The lower-bit of N is 1, so we know N = 1 + (qp * 2^1 + qp-1 * 2^2 + ... + q1 * 2^p). Obviously, N is an odd integer. Originally, n = 2^i + 2^j. Since scaling a number is just multiplying or dividing it by 2, n' is still a sum of two exponents of 2. Let them be n' = 2^i' + 2^j'. For convenience, it's assumed that i' ≥ j'.

2^p - 1 ≤ n' < 2^p, so 2^p - 1 ≤ 2^i' + 2^j' < 2^p. As a result, i', which accounts for a higher proportion of n', must be p - 1. To make n' less than 2^p, j' mustn't equal to i'. For readability, let k = j', so k ≤ p - 2. Thus,

. (Replace all n by n', m by m')

I suggest you to use some scratch papers to verify this formula on your own.

Update 2

Take a look at the numerator |(2^(p - 1 - k) + 1) * N - 2^(p + 1 - k) * m'|. As we have proved, k ≤ p - 2, so p - 1 - k ≥ 1, ensuring 2^(p - 1 - k) and 2^(p + 1 - k) to be even. Both (2^(p - 1 - k) + 1) and N are odd, so (2^(p - 1 - k) + 1) * N is odd, while 2^(p + 1 - k) * m' is even, and an odd integer can't equal to an even one. Therefore, (2^(p - 1 - k) + 1) * N - 2^(p + 1 - k) * m' is a non-zero integer. It's absolute value, namely the numerator, is consequently guaranteed to be equal to or greater than 1. Hence,

|q̂ - q'| ≥ 1 / (n' * 2^(p + 1 - k)).

q' < 1, and q̄ < 1, so q' * q̄ < 1. As a result, (m' * q') * q̄ < m', that is, n' * q̄ < m. Consequently,

|n' * q̄ - m'|

= m' - n' * q̄

= n' * (q' - q̄)

/* Since q̄ only has a precision of p, it will be 0.q1 q2 ... qp. So q̄ = q̂ - 2^(- p - 1). */

= n' * {q' - [q̂ - 2^(- p - 1)]}

= n' * [q' - q̂ + 2^(- p - 1)]

/* Assume that q' < q̂. The case q > q̂ is not discussed. */

= n' * [- |q' - q̂| + 2^(- p - 1)]

/* Since |q̂ - q'| ≥ 1 / (n' * 2^(p + 1 - k)), - |q̂ - q'| ≤ - 1 / (n' * 2^(p + 1 - k)). So */

≤ n' * {- 1 / [n' * 2^(p + 1 - k)] + 2^(- p - 1)}

= n' * {2^(- p - 1) - 1 / [n' * 2^(p + 1 - k)]}

/* We know n' = 2^i' + 2^j' = 2^(p - 1) + 2^k */

= [2^(p - 1) + 2^k] * {2^(- p - 1) - 1 / {[2^(p - 1) + 2^k] * 2^(p + 1 - k)}}

/* The equation is becoming less as less readable. For brevity, insignificant algebraic steps are omitted. */

= 2^-2 + 2^(- p - 1 + k) - 1 / 2^(p + 1 - k)

= 1/4

By now, |n' * q̄ - m'| ≤ 1/4 have been established. As mentioned above, this proves the theorem.

Q.E.D.

Questions that still remain:

Why "the case q > q̂ is similar" ? I think things will be totally different without that minus!