Search code examples
floating-pointtype-conversioncomparisonendiannessieee-754

Packing an IEEE-754 16-bit float to a 16-bit unsigned integer while preserving order


I have an IEEE-754 16-bit float that I'd like to losslessly pack as a 16-bit unsigned integer. The easiest way of course is to just pack its bytes and then unpack it, but the snag is that I need to compare the 16-bit integers afterwards (ie greater than, less than, etc) in my program. So I'm looking for an isomorphism between f16 and u16 that preserves order. Could anyone suggest an algorithm that does this? Thanks!


Solution

  • To maintain <, ==, > of a float16 with integer math, treat the data as if it was a signed integer encodes using sign-magnitude.

    Do this with float and (u)int32_t to get the code right (as float16_t not well available to all) and then adjust for 16-bit.

    Negate negative values to positive and set the MSBit for positive ones.

    Make certain +0.0 and -0.0 convert to the same value.

    // Assumes same endian for FP and integers
    #include <float.h>
    #include <limits.h>
    #include <stdint.h>
    #include <stdio.h>
    
    // Assumes same endian for FP and integers
    unsigned float_to_sequence(float f) {
      union {
        float f;
        int32_t i;
        uint32_t u;
      } x = {.f = f};
      if (x.i < 0) {
        x.u = -x.u;
      } else {
        x.u |= 0x80000000;
      }
      return x.u;
    }
    

    Test

    void test(float f) {
      printf("%+-20a %+-18.9e ", f, f);
      printf("0x%08X\n", float_to_sequence(f));
    }
    
    int main(void) {
      float f[] = {-INFINITY, -FLT_MAX, -1.0, -FLT_TRUE_MIN, -0.0, //
          0.0, FLT_TRUE_MIN, 1.0, FLT_MAX, INFINITY};
      size_t n = sizeof f / sizeof f[0];
      for (size_t i = 0; i < n; i++) {
        test(f[i]);
      }
    }
    

    Output

    -inf                 -inf               0x00800000
    -0x1.fffffep+127     -3.402823466e+38   0x00800001
    -0x1p+0              -1.000000000e+00   0x40800000
    -0x1p-149            -1.401298464e-45   0x7FFFFFFF
    -0x0p+0              -0.000000000e+00   0x80000000
    +0x0p+0              +0.000000000e+00   0x80000000
    +0x1p-149            +1.401298464e-45   0x80000001
    +0x1p+0              +1.000000000e+00   0xBF800000
    +0x1.fffffep+127     +3.402823466e+38   0xFF7FFFFF
    +inf                 +inf               0xFF800000
    

    The conversion is one-one except for +0.0 and -0.0 both convert to the same value - as it should.

    For a 16-bit one liner: uint16_t y = (x & 0x8000) ? -x : (x | 0x8000);