I'm writing a C program that tests operations on single precision floating point numbers. In particular, I want to know if the absolute rounding error associated with x (a number within the normal range of binary32):
|round(x) - x| < machine-epsilon * 2^E
is also true for subnormal numbers. I've concluded that yes, but I couldn't find any secondary source to confirm that I'm right.
Is my conclusion correct?
For subnormal results, the maximum potential absolute error from rounding a real number to the nearest representable IEEE-754 binary floating-point number is the same as the maximum potential absolute error for results in the smallest normal binade.
That is, both the smallest normal binade and the subnormal numbers have the same exponent in the floating-point representation, so the least significant bit of their significands have the same position value. (The “same exponent” referred to here is the mathematical exponent, −126 for the basic 32-bit format and −1022 for the basic 64-bit format. It is not the encoding of the exponent, which is 1 for the smallest normal binade and 0 for subnormals.) The position value of the least significant bit determines the maximum possible error. Since it is the same for subnormal values as it is for values in the smallest binade, they have the same potential error.
If E is the greater of the minimum exponent of the format or floor(log2(|x|)), x is a real number, round(x) is the result of converting x to the floating-point format, and 𝜀 is the machine epsilon (unit of least precision for 1) then:
|round(x)−x| < 𝜀 2E
for any rounding mode, and:
|round(x)−x| ≤ ½ 𝜀 2E
for any round-to-nearest mode (such as round-to-nearest-ties-to-even).
For the directed rounding modes, such as round-toward-negative-infinity, the error satisfies either:
0 ≤ round(x)−x < 𝜀 2E
or:
−𝜀 2E < round(x)−x ≤ 0,
depending on the direction, of course. (Round-toward-zero satisfies neither of the above two relations for all real numbers but of course satisfies at least one of them depending on the sign of x.)