I'm using libfixmath for a simulation the needs to take place on two devices (iOS / Android) at the same time and be perfectly accurate.
The simulation needs to be fed with some initial float parameters. I'm wondering, if it's safe to use floats and then convert them into fix16_t in the way below (the function is from the library) or if I need to feed the simulation with fix16_t values already?
Hence, is it possible that two different devices calculate different results with the same input for the function below, because of floating point inaccuracy?
typedef int32_t fix16_t;
static const fix16_t fix16_one = 0x00010000; /*!< fix16_t value of 1 */
static inline fix16_t fix16_from_float(float a)
{
float temp = a * fix16_one;
// rounding
temp += (temp >= 0) ? 0.5f : -0.5f;
return (fix16_t)temp;
}
Assuming that:
both machines use IEEE-754 single precision floating point representation for float
, and
the value a
is "reasonable"
the conversion should be portable, with the possible exception of the case where the absolute value of a
is just slightly less than 0.5×2−16.
Multiplying a (binary) floating point number by a power of 2 (in this case 216) is precise provided that it doesn't cause a floating point overflow (or underflow in the case of negative powers of 2). Every floating point implementation should handle that multiplication in precisely the same way.
The C++ standard requires conversion from floating point to integer types to truncate towards 0, so the rounding strategy is correct.
Adding 0.5 to temp
will produce the correct result in almost all cases.
For intermediate values of temp
, the result will be precise.
If temp
is greater than 223, the addition will have no effect, but there is no fraction to be rounded, so the end result will be predictable as long as there is no overflow when casting back to an integer.
If temp
is less than 1.0, the sum will be imprecise, because the exponent will be increased. The addition should then round to produce the correct result. Here, the only case of interest is where the truncated sum might be either 0 or 1; if temp
is not close to 0.5, the sum cannot be 1.0 and the truncated sum must be 0. If temp
is at least 0.5, the sum must be at least 1.0, and the truncated sum must be 1.
But if temp
is just slightly less than 0.5, rounding of the sum may be significant. In particular, if temp
is precisely 0.5−2−25, there is an ambiguity. The result of the sum will be 1.0−2−25, but this value is not precisely representable as an IEEE-754 single-precision float. Moreover, the error term is precisely one-half of a ULP. So the result needs to be rounded, and that will obey the rounding mode of the implementation.
The default rounding mode for IEEE-754 is "banker's rounding", where rounding of a value of exactly one-half is towards whichever of the two possibilities has a 0 as its low-order bit. That will favour rounding 0.5−2−25 + 0.5 to 1.0, which will produce the incorrect integer truncation 1. However, it is possible that a given implementation uses a different rounding mode, perhaps because it has been set using std::fesetround
.
All of the above applies equally to negative values.