Search code examples
floating-pointarmprecisionarbitrary-precisionfpu

Is it possible to implement IEEE 754 conforming floating-point arithmetic based on ARM pseudocode using finite precision floating-point arithmetic?


Background: usually floating-point arithmetic is implemented using integer arithmetic (for example, Berkeley SoftFloat). Per ARM pseudocode [1] floating-point arithmetic is implemented using infinite precision floating-point arithmetic (type real).

My 32-bit floating-point arithmetic's model is written in C and is based on ARM pseudocode. The type real is implemented using finite precision floating-point arithmetic: 64-bit double or 80-bit long double (on x86_64) or 128-bit long double (on AArch64):

typedef double Real;
//typedef long double Real;

While testing it I've noticed some failures: most fully related to missing Inexact and / or Underflow exceptions. In some cases result is +/-1 bit off.

Background: in contrast with integer arithmetic based implementations (which check for certain bits to be non-zeros), ARM pseudocode function FPRoundBase computes error:

// Get the unrounded mantissa as an integer, and the "units in last place" rounding error.
int_mant = RoundDown(mantissa * 2.0^F);  // < 2.0^F if biased_exp == 0, >= 2.0^F if not
error = mantissa * 2.0^F - Real(int_mant);

Raising of Inexact and / or Underflow exceptions depends on this error:

if !altfp && biased_exp == 0 && (error != 0.0 || trapped_UF) then
    if fpexc then FPProcessException(FPExc_Underflow, fpcr);
...
if error != 0.0 then
    if fpexc then FPProcessException(FPExc_Inexact, fpcr);

My problem: in some cases, the error is zero, while it is expected to be non-zero, leading to missing Inexact and / or Underflow exceptions. Note that, however, the numeric result is correct in these cases. Here is an example for x + y:

x                        -4.96411207e-35         0x8683f7ff
y                        -3.98828101             0xc07f3fff
x after FPUnpack         -4.9641120695506692e-35 0xb8d07effe0000000
y after FPUnpack         -3.9882810115814209     0xc00fe7ffe0000000
x+y                      -3.9882810115814209     0xc00fe7ffe0000000
=== FPRoundBase ===
op                       -3.9882810115814209     0xc00fe7ffe0000000 
exponent                 1
min_exp                  -126
biased_exp               128
int_mant                 16728063
mantissa                 1.9941405057907104      0x3fffe7ffe0000000
frac_size                23
error                    0                       0x0
===

Here we see that error is zero, while it is expected to be non-zero.

If we multiply 1.9941405057907104 to 2^23 we will get 16728062.9999999995871232, which is rounded to 16728063, and 16728063 - 16728063 is 0.

I've tried to locally increase precision when computing error: some failures fixed, new failures appeared. I've also tried some other "quirks and tweaks" with the same result: some failures fixed, new failures appeared.

Note that all the operations on Real (i.e. double) are done using FE_TONEAREST.

Finally, I've started thinking: is it possible to implement IEEE 754 conforming 32-bit (for example) floating-point arithmetic based on ARM pseudocode using finite precision floating-point arithmetic?

[1] Exploration Tools (section "Arm A64 Instruction Set Architecture", button "Download XML"), file ISA_A64_xml_A_profile-2023-03/ISA_A64_xml_A_profile-2023-03/xhtml/shared_pseudocode.html.


UPD0. I've noticed that 128-bit long double leads to 50% less failures than 64-bit double.

UPD1. The "error free" means "IEEE 754 conforming". Changed to "IEEE 754 conforming".


Solution

  • I started using GNU MPFR:

    typedef mpfr_t          Real;
    

    Tests show that:

    • it is possible to implement IEEE 754 conforming 32-bit (for example) floating-point arithmetic based on ARM pseudocode using finite precision floating-point arithmetic;
    • for each FP operation the minimum MPFR precision, which leads to reaching "IEEE 754 conforming" property, is different. Examples to be added.

    UPD0: Using the tests above I've found the following minimum MPFR precisions:

    ADD     277
    SUB     277
    MUL     48
    DIV     48
    D2F     53
    FMA     426
    

    Note: due to unexhaustive tests these numbers can be higher. I cannot explain these numbers. I've found one correlation with the "Expand-and-Sum Solution" from Secure and Accurate Summation of Many Floating-Point Numbers:

    Single: a single-precision floating-point number has 1 sign bit, an 8-bit exponent, and a 23-bit mantissa. Thus, representing this as an integer requires 1 + 28 + 23 = 280 bits.

    Notes:

    1. Does it mean that per the "Expand-and-Sum Solution" the FMA requires 280 * 2 = 560 bits?
    2. As I understand, the "Expand-and-Sum Solution" is not the same as "ARM pseudocode solution". The former one uses integer arithmetic, the latter one uses floating-point arithmetic.