Search code examples
cfloating-pointbinarybitbit-shift

Converting a float to a 16 bit binary value in C


This is my code. When I change the mantissa, exponent, and sign to 23, 8, and 1 respectively in order to represent a 32 bit number, my code works. However when the values are changed to represent the bit ratio according to 16 bit, it is not returning the correct value.

#include <stdio.h>
#include <stdlib.h>
typedef union
{
  float val;
  struct
  {
    unsigned int mantissa: 10;
    unsigned int exponent: 5;
    unsigned int sign: 1;
  }float16;
 
}IEEE_FP;

void print(unsigned int val, int num_bits)
{
  for(int i = num_bits-1; i >= 0; i--)
    {
      if((val>>i)&1){printf("1");}
      else{printf("0");}
    }
}

void print16(IEEE_FP test)
{
  print(test.float16.sign, 1);
  print(test.float16.exponent, 5);
  print(test.float16.mantissa, 10);
}


int main()
{

  IEEE_FP test;
  test.val = 0.789;
  print16(test);
  return EXIT_SUCCESS;
}

I tried to pass the float 0.789 and was expecting a 0011101001010000, however my code prints 1111101111100111. Any ideas?


Solution

  • The reason the code works for a 32-bit float value is that, when a value is stored in the float val member of the union, it is stored using a representation for the 32-bit float type. (This type is the IEEE-754 binary32 format, also called “single precision” IEEE-754.) That representation is prepared by a combination of hardware and software parsing the 0.789 string in the source code and converting it to the binary32 format.

    The union works because the bit-fields in the struct member of the union use the same member as the float val member of the union. Reading the values using bit-fields interprets the bits as integers instead of interpreting them as a binary32. Then, having those values as integers, your program can print them. (Additionally, how bit-fields are laid out is largely implementation-defined. The C standard allows flexibility in the order and alignment of bit-fields with respect to the bytes in memory. So this code is not portable.)

    When you change the bit-field definitions to have the widths for a 16-bit floating-point type, nothing changes about float val. The same bits for a binary32 are still in memory. Nothing automatically changes the bits in float val to be the bits of a 16-bit floating-point type instead of the bits for a binary32.

    To show the bits that would represent the value in a 16-bit floating-point type, you largely have two choices:

    • Change float in float val to the name of a 16-bit floating-point type. This is not specified by the C standard, but your compiler might have a type for it.
    • Write software to calculate what the bits would be. This may require separating the sign of the number, calculating what exponent should be used, scaling the significand to normalize it, and rounding the bits of the significand to 11 bits.

    Further information on doing that depends on circumstances. If this is a school assignment, you should use information provided in prior lessons. If not, then how to proceed depends on why you are trying to do this, whether it needs to be fast or just educational, whether it needs to be portable or can be customized for a particular C implementation, and so on.