Search code examples
armieee-754half-precision-float

float16_t rounding on ARM NEON


I am implementing emulation of ARM float16_t for X64 using SSE; the idea is to have bit-exact values on both platforms. I mostly finished the implementation, except for one thing, I cannot correctly emulate fma operation on float16_t for one particular set of input values. Here is it (this example is for ARM because it reproduces both on ARM and X64):

#include <arm_fp16.h>
#include <iostream>

#pragma GCC diagnostic ignored "-Wfloat-equal"

float fma_explicit(float a, float b, float c) {
    float result;
    asm volatile(
        "fmadd %s0, %s1, %s2, %s3" // result = a * b + c
        : "=w"(result)             // output
        : "w"(a), "w"(b), "w"(c)   // inputs
    );
    return result;
}

int main() {
    float16_t x = 56.84375f;
    float16_t y = 17.90625f;
    float16_t z = 0.07940673828125f;

    float16_t res = vfmah_f16(x, y, z);
    float16_t res_emulated = static_cast<float16_t>(fma_explicit(y, z, x));

    std::cout << "Float16 result         " << res << "\n";
    std::cout << "Float32 full precision " << fma_explicit(y, z, x) << "\n";
    std::cout << "Float32 emulated       " << res_emulated << "\n";

    if (res == res_emulated) {
      std::cout << "They are same\n";
    } else {
      std::cout << "They are different\n";
    }
}

The idea is quite simple: I emulate float16_t by working with 32-bit floats. To emulate vfmah_f16, I use fmadd on floats, and in the last step, I convert the result back to float16_t. For the particular combination of input values, the program prints:

Float16 result         58.2813
Float32 full precision 58.2656
Float32 emulated       58.25
They are different

I don't understand if this is hardware issue on ARM or expected behavior. To run the above program, you need an ARMv8.2 chip with fp16 support.


Solution

  • Using binary for the significand and apostrophes to mark the ends of binary16 and binary32 significands, the exact value of 17.90625•0.07940673828125 + 56.84375 is: 1.1101001000'1000000000000'12•25.

    As you can see, this is above the halfway-point between 1.11010010002•25 and 1.11010010012•25, and therefore rounding to binary16 should produce the greater value, 1.11010010012•25 = 58.28125.

    However, when we round to binary32, it is rounded at the second apostrophe. The number is halfway between two adjacent binary32 values, and ties round to the even low digit, so the result is 1.1101001000'10000000000002•25. Now this value is exactly halfway between the two binary16 values, 1.11010010002•25 and 1.11010010012•25. So, when this binary32 value is rounded to binary16, we round to the even low digit, yielding 1.11010010002•25 = 58.25.

    This shows that implementing a binary16 FMA by performing a binary32 FMA and rounding to binary16 is incorrect. The binary32 FMA loses information that is needed to produce the correct result.