c algorithm math floating-point floating-point-conversion

Accurately predicting rounding error of cast between arbitrary floating-point formats

Let’s assume you have a float64_t number with an arbitrary value and you want to find out if said number can safely be down-cast to a float32_t with the restriction that the resulting rounding error must not exceed a given epsilon.

A possible implementation could look like this:

float64_t before = 1.234567890123456789;
float64_t epsilon = 0.000000001;
float32_t mid = (float32_t)before;  // 1.2345678806304931640625
double after = (float64_t)mid; // 1.2345678806304931640625
double error = fabs(before - after); // 0.000000009492963526369635474111
bool success = error <= epsilon; // false

To make things more interesting though, let’s assume you’re not supposed to perform any actual type casting of the value at hand between those two types (as seen above).

And just to take it up a notch: Let’s assume you’re not casting down to float32_t, but a floating-point type of arbitrary precision (8bit, 16bit, 32bit, or maybe even 24bit) specified by its bit count and exponent length (and following the conventions of IEEE 754 such as rounding ties to even).

So what I’m looking for is a generic algorithm more akin to this:

float64_t value = 1.234567890123456789;
float64_t epsilon = 0.000000001;
int bits = 16;
int exponent = 5;
bool success = here_be_dragons(value, epsilon, bits, exponent); // false

To give an example down-casting the 64bit number 1.234567890123456789 to a lower precision results in the following rounding errors:

 8bit: 0.015432109876543309567864525889
16bit: 0.000192890123456690432135474111
24bit: 0.000005474134355809567864525889
32bit: 0.000000009492963526369635474111
40bit: 0.000000000179737780214850317861
48bit: 0.000000000001476818667356383230
56bit: 0.000000000000001110223024625157

What’s known:

The specifications for both precision types in question (one of lower precision than the other):
- total length (in bits) (would be 32 for float, e.g.)
- exponent length (in bits) (would be 8 for float, e.g.)
The min and max values of each type (as those can be derived from the above).
The number of positive normal values (excluding zero) (((2^exponent) - 2) * (2^mantissa))
The exponent’s bias ((2^(exponent - 1)) - 1)
The actual value (provided in the the given higher precision type).
The error threshold epsilon allowed for the down-cast to fall within in order to be considered successful (also provided in the the given higher precision type).

(An approximation of the expected error might be enough, depending on its accuracy and deviation factors. Exact calculations preferred though, obviously.)

Cases that need not be covered (as they are trivially solvable in isolation):

If the input value is of any non-normal value (subnormal, infinity, nan, zero, …), then the answer shall hereby be defined to be true.
If the input value falls outside the known boundaries (+- the given epsilon) of a given type of lower precision, then the answer shall hereby be defined to be false.

What I’ve thought up so far:

We know the count of positive normal values (excluding zero) in a given floating-point type and we know that the negative value space is symmetric to the positive one.

We also know that the distribution of discrete values within the value range (away from zero) follows an exponential function and its relative epsilon a related step function:

It should be possible to calculate which nth discrete normal value a given real value within the normal value range of a given floating-point type would fall onto (via some kind of logarithmic projection, or something?), shouldn’t it? Given this n one should then be able to calculate the epsilon of the corresponding value from its step function and compare it against the specified max-error, no?

I have the feeling that this might actually be enough to calculate (or at least accurately estimate) the expected casting error. I just have no idea how to put these things together.

How would you approach this? (Bonus points for actual code :P)

_{Ps: To give a bit more context: I’m working on a var_float implementation and in order to figure out the smallest losslessly (or lossy within a given epsilon) convertible representation for a given value I’m currently performing a binary search utilizing aforementioned naive round-trip logic to find the right size. It works, but it’s lacking in the efficiency and coolness department. Even though it by no means is a performance bottleneck (yada yada premature optimization yada yada), I'm curious as to whether one can find a more math-grounded and elegant solution. ;)}

Solution

Something like the following may work:

double isbad(double x, double releps) {
  double y = x * (1 + 0x1.0p29);
  double z = y-x-y+x;
  return !(fabs(z/x) < releps);
}

This uses a trick (due to Dekker, I believe) for splitting a floating-point number into a "big half" and a "little half" that sum exactly to the original number. I want the "big half" to have 23 bits and the "little half" to have the rest, so I split using the constant 1 + 2^(52-23).

Caveats: You need to handle the more limited exponent range by checking against upper and lower bounds. Subnormals (especially the case where the result is subnormal in the little type but not the big type) need different special handling. I wrote !(fabs(z/x) < releps) instead of fabs(z/x <= releps because I want NaNs to qualify as "bad." releps is a bad name for that variable, since the threshold is actually half an ulp larger than the number you specify when using round-to-nearest.