Please confirm my understanding of IEEE floats.
I believe that if I have two numbers:
a
= 2ea × ma ← “mathematical ×”b
= 2eb× mbThen their product a * b
is bit-identical to ((a * p) * b) / p
, where p
is a power-of-two (caveat: the intermediate values don’t overflow to infinity or lose precision around exponent -308; that is, it’s a p
that satisfies (a * p) / p = a
).
According to my understanding, a * b
(IEEE *
) is defined to be “the closest float to the mathematical value 2ea+eb × ma × mb” (closest, or whatever the rounding method does). That involves multiplying the mantissas, then shifting the mantissa as necessary (eg. 1.11111×1.11111 = 11.something so the final result will have exponent ea+eb+1, and the mantissa rounded to the number of bits in the float).
All of that is basically independent of the values of ea and eb; we can increase/decrease them however we want, and then undo it afterwards, and it won’t affect the final product.
Corollary:
int
. Then I can product the doubles reduced to the range ±[1,2) and finally bit-twiddle the exponent back in (and error if it's out of range ±1024). This would help cases like [1E300, 1E300, 1E-200, 1E-200] where a naive product would fail. And for cases where the naive product would work, the bit-twiddling version gives identical results.TLDR: You are correct
Some assumptions
Then every normalized floating point number is encoded as 1.x * 2^y
with a binary mantissa 1.x
and an integer exponent y
. An integer power of 2 is therefore encoded as 1.0 * 2^p
. If we disregard the bit shifts for normalization, a multiplication is implemented as
(1.x * 2^y) * (1.u * 2^w) == (1.x * 1.u) * 2^(y+w)
Conversely, division is (1.x / 1.u) * 2^(y-w)
. Following this pattern, a multiplication with a power of two contains the multiplication 1.x * 1.0
which is the identity function and cannot introduce issues with associativity. The addition y + p
for the exponent is an integer addition, which is associative. Therefore, multiplication and division by a power of two is associative.
This is not true if any intermediate result falls outside the integer range of the exponent. Then the result is denormalized and bits of the mantissa are rounded off.
It means that you can replace any (relatively expensive) division by a power of 2 with a multiplication by its inverse without loss of precision. For compile time constants, compilers will do this automatically.
It is often useful to normalize numbers. For example when computing the L2 vector norm (sqrt(x^2 + y^2 + …)
) it is useful to determine the scaling factor s = max(abs({x, y, …}))
and compute s * sqrt((x/s)^2 + (y/s)^2 + …)
to avoid over- or underflowing the intermediate squaring.
If we instead round s
to a nearby power of 2, no rounding errors are introduced and the division can be turned into a multiplication. The standard math functions frexp
and ldexp
can be used to pick the factor.
Goldberg's What every computer scientist should know about floating-point arithmetic contains theorem 7, which is related. It also contains the simple note "Scaling by a power of 2 is harmless, since it changes only the exponent not the significant." which supports our assessment.