Background: usually floating-point arithmetic is implemented using integer arithmetic (for example, Berkeley SoftFloat). Per ARM pseudocode [1] floating-point arithmetic is implemented using infinite precision floating-point arithmetic (type real
).
My 32-bit floating-point arithmetic's model is written in C and is based on ARM pseudocode. The type real
is implemented using finite precision floating-point arithmetic: 64-bit double
or 80-bit long double
(on x86_64) or 128-bit long double
(on AArch64):
typedef double Real;
//typedef long double Real;
While testing it I've noticed some failures: most fully related to missing Inexact
and / or Underflow
exceptions. In some cases result is +/-1 bit off.
Background: in contrast with integer arithmetic based implementations (which check for certain bits to be non-zeros), ARM pseudocode function FPRoundBase
computes error
:
// Get the unrounded mantissa as an integer, and the "units in last place" rounding error.
int_mant = RoundDown(mantissa * 2.0^F); // < 2.0^F if biased_exp == 0, >= 2.0^F if not
error = mantissa * 2.0^F - Real(int_mant);
Raising of Inexact
and / or Underflow
exceptions depends on this error
:
if !altfp && biased_exp == 0 && (error != 0.0 || trapped_UF) then
if fpexc then FPProcessException(FPExc_Underflow, fpcr);
...
if error != 0.0 then
if fpexc then FPProcessException(FPExc_Inexact, fpcr);
My problem: in some cases, the error
is zero, while it is expected to be non-zero, leading to missing Inexact
and / or Underflow
exceptions. Note that, however, the numeric result is correct in these cases. Here is an example for x + y
:
x -4.96411207e-35 0x8683f7ff
y -3.98828101 0xc07f3fff
x after FPUnpack -4.9641120695506692e-35 0xb8d07effe0000000
y after FPUnpack -3.9882810115814209 0xc00fe7ffe0000000
x+y -3.9882810115814209 0xc00fe7ffe0000000
=== FPRoundBase ===
op -3.9882810115814209 0xc00fe7ffe0000000
exponent 1
min_exp -126
biased_exp 128
int_mant 16728063
mantissa 1.9941405057907104 0x3fffe7ffe0000000
frac_size 23
error 0 0x0
===
Here we see that error
is zero, while it is expected to be non-zero.
If we multiply 1.9941405057907104
to 2^23
we will get 16728062.9999999995871232
, which is rounded to 16728063
, and 16728063 - 16728063
is 0
.
I've tried to locally increase precision when computing error
: some failures fixed, new failures appeared. I've also tried some other "quirks and tweaks" with the same result: some failures fixed, new failures appeared.
Note that all the operations on Real
(i.e. double
) are done using FE_TONEAREST
.
Finally, I've started thinking: is it possible to implement IEEE 754 conforming 32-bit (for example) floating-point arithmetic based on ARM pseudocode using finite precision floating-point arithmetic?
[1] Exploration Tools (section "Arm A64 Instruction Set Architecture", button "Download XML"), file ISA_A64_xml_A_profile-2023-03/ISA_A64_xml_A_profile-2023-03/xhtml/shared_pseudocode.html
.
UPD0. I've noticed that 128-bit long double
leads to 50% less failures than 64-bit double
.
UPD1. The "error free" means "IEEE 754 conforming". Changed to "IEEE 754 conforming".
I started using GNU MPFR:
typedef mpfr_t Real;
Tests show that:
UPD0: Using the tests above I've found the following minimum MPFR precisions:
ADD 277
SUB 277
MUL 48
DIV 48
D2F 53
FMA 426
Note: due to unexhaustive tests these numbers can be higher. I cannot explain these numbers. I've found one correlation with the "Expand-and-Sum Solution" from Secure and Accurate Summation of Many Floating-Point Numbers:
Single: a single-precision floating-point number has 1 sign bit, an 8-bit exponent, and a 23-bit mantissa. Thus, representing this as an integer requires 1 + 28 + 23 = 280 bits.
Notes: