Search code examples
ieee-754half-precision-float

float.h-like definitions for IEEE 754 binary16 half floats


I'm using half floats as implemented in the SoftFloat library (read: 100% IEEE 754 compliant), and, for the sake of completeness, I wish to provide my code with definitions equivalent to those available in <float.h> for float, double, and long double.

I know there are different flavours of half floats, but I'm just interested in the standardized one by IEEE 754, known as binary16.

From my research, and from my tests, I'm confident to define some of the constants as follows:

#define HALF_MANT_DIG      11
#define HALF_DIG           3
#define HALF_DECIMAL_DIG   5
#define HALF_EPSILON       UINT16_C(0x1400) /* 0.00097656 */
#define HALF_MIN           UINT16_C(0x0400) /* 0.00006103515625 */
#define HALF_MAX           UINT16_C(0x7BFF) /* 65504.0 */

NOTE: epsilon, min, and max are defined as the raw hexadecimal representation of the 16bits taken by the type. The proper way of assigning the raw value to the type depends on the half float library used.

However, for the exponent-related definitions, I wasn't able to find consensus. I have taken a look at the Wikipedia page for binary16, at this other SO question, at the Half library, and at several other code in GitHub and other places.

The proposal linked from that other SO question sounds reputable to me, as well as the Half library and the good news is that they match. However, I found disagreement at the FP16.java implementation, at this implementation, at the Zig language implementation, and at Sargon for D.

#define HALF_MIN_EXP     The article and Half say (-13) but FP16.java and sargon say (-14) 
#define HALF_MAX_EXP     The article and Half say 16 but others say 14 or 15
#define HALF_MIN_10_EXP  The article and Half say (-4) but sargon says (-5)
#define HALF_MAX_10_EXP  The article and Half say 4 but sargon says 5

I'd suppose the article and Half are likely the sources to be right, but, can I know for sure the good values for IEEE 754 binary16?


Solution

  • #define HALF_MANT_DIG 11

    Yes, the binary16 format has 11 significant digits (bits). (10 are stored in the primary significand field and 1 is encoded via the exponent field.)

    #define HALF_DIG 3

    I do not have a reference at hand, so no comment. But this could be tested without too much difficulty.

    #define HALF_DECIMAL_DIG 5

    IEEE 754-2019 gives this as 1+ceiling(p×log10(2)), where p is the “number of significant bits” in the format, hence 11, so 1+ceiling(11•.3010299957) = 1+ceiling(3.3) = 1+4 = 5.

    #define HALF_EPSILON UINT16_C(0x1400) /* 0.00097656 */

    Yes, with 11 significand bits, 1 is represented with a high bit of 20 and a low bit of 2−10, which is .0009765625. That is encoded with an exponent bias of 15, so 5 in the exponent field, so 5 << 11, which is 140016.

    #define HALF_MIN UINT16_C(0x0400) /* 0.00006103515625 */

    Yes, the minimum normal exponent encoding is 1, removing the bias gives −14, which is .00006103515625, and 1 in the exponent field gives 040016.

    #define HALF_MAX UINT16_C(0x7BFF) /* 65504.0 */

    Yes, the maximum normal exponent field is 30, 30 << 11 gives 780016 and the maximum significand field is 11111111112 = 3FF16, and combining them gives 7BFF16. Removing the exponent bias of 15 gives 15, so value represented is 215•(2−2−10) = 65,504.

    #define HALF_MIN_EXP The article and Half say (-13) but FP16.java and sargon say (-14)
    #define HALF_MAX_EXP The article and Half say 16 but others say 14 or 15

    C defines the floating-point representation to have the significand digits starting after the radix point, instead of having one before the radix point and the rest after. That is, for a floating-point format with base b, the significand is in [1/b, 1) instead of [1, b). This is visible in the values of *_MIN_EXP and *_MAX_EXP and the behavior of the frexp function, and the exponents are off by one from the more common definition used in IEEE 754.

    Per IEEE-754, the exponent range is [−14, 15], so, for the C standard’s scaling, it is [−13, 16].

    #define HALF_MIN_10_EXP The article and Half say (-4) but sargon says (-5)

    C 2018 5.2.4.2.2 12 says this is ⌈log10bemin−1⌉, where emin is HALF_MIN_EXP, so we have ⌈log102−13−1⌉ = ⌈−4.2144…⌉ = −4. And we know from HALF_MIN above that 10−4 is in the normal range and 10−5 is not, so −4 is “minimum negative integer such that 10 raised to that power is in the range of normalized floating-point numbers,” which is also in 5.2.4.2.2 12.

    #define HALF_MAX_10_EXP The article and Half say 4 but sargon says 5

    As above, the C standard gives this as ⌊log10((1−b− p)bemax)⌋ = ⌊log10((1−2− 11)216)⌋ = ⌊log10((1−2− 11)216)⌋ = ⌊log10(65,504)⌋ = ⌊4.8162…⌋ = 4, and 104 is below HALF_MAX but 105 is not.