Search code examples
floating-pointieee-754

Can someone explain a conversion from decimal to IEE 754 Binary32?


I'm trying to convert the number 170.3 into a IEE 754 binary32 float:

You can see my working from the images below:

Converting 170 into Binary

So 170 in Binary is 10101010

Converting 0.3 into Binary

We can see that the pattern 1001 will repeat for ever, so we have something like

0.3 = 0.01001 where the bold part is recurring

Putting these together

When we put these numbers together, we can get the binary representation of the whole value:

170.3 = 10101010.01001

where the bold part is recurring.

Converting this into Standard Form

170.3 = 1.010101001001 x 2⁷

How this should get stored:

This is how our 4 bytes (32 bits) are allocated:

  1. The sign is 0, since we are working with a positive number
  2. The exponent is 127 + 7 = 134 which in binary is 10000110
  3. The fraction is then filled in by the first 23 bits of our recurring decimal, which in this case is 01010100100110011001100 (where the recurring part is in bold)

So we can combine these together to get the binary data to store into our 4 bytes (32 bits):

01000011001010100100110011001100

Which, when split into 4 bytes should be:

01000011-00101010-01001100-11001100

I then try and run this C++ program, which stores the float and prints the memory:

#include <iostream>

/* Prints Contents of Memory Blocks */
static void print_bytes(const void *object, size_t size){
    #ifdef __cplusplus
    const unsigned char * const bytes = static_cast<const unsigned char *>(object);
    #else // __cplusplus
    const unsigned char * const bytes = object;
    #endif // __cplusplus

    size_t i;

    printf("[-");
    for(i = 0; i < size; i++)
    {
        //printf(bytes[i]);
        int binary[8];
        for(int n = 0; n < 8; n++){
            binary[7-n] = (bytes[size -1 - i] >> n) & 1;
        }
        /* print result */
        for(int n = 0; n < 8; n++){
            printf("%d", binary[n]);
        }
        printf("%c", '-');
    }
    printf("]\n\n");
}

int main () {

    std::cout << "\nStoring a Float in Memory";
    std::cout << "\n----------------------------\n\n";

    float height = 170.3f;

    std::cout << "Address is "<< &height << "\n\n";
    std::cout << "Size is "<<  sizeof(height) << " bytes\n\n";
    std::cout << "Value is " <<  height << "\n\n";

    std::cout << "Memory Blocks : \n";
    print_bytes(&height, sizeof(height));

    return 0;
}

But in the output, I can see that the last bit is a 1 and not a 0 as per my calculations:

And also, when using online converters, the last bit also becomes a 1:

Could someone please explain to me where I went wrong in my calculation?


Solution

  • Could someone please explain to me where I went wrong in my calculation?

    OP did not properly account for rounding.

    Typically conversion uses the rounded value (round to nearest , ties to even)

     12345678 9012345678901234
    +10000110.                           134
            0.0100110011001100 1 1001...     0.3
    +10000110.0100110011001100 1 1001... Sum
                               v vvvvvvv
                               1 |       extra bit past the 24
                                 1       "or" of the rest of the bits
    +10000110.0100110011001100 1 1       Value prior to rounding
    ^                        ^ ^ ^       These 4 bits & rounding mode determine round value
    +                        1           Round value to add (assume round to nearest, ties to even)
    +10000110.0100110011001101           Sum
    + 0000110.0100110011001101           23-bit portion explicitly stored.   
    

    Amend algorithm to 1) one more bit, the "24th" bit (starting form 0th bit) and 2) the "or" of all the lesser bits (25th, 26th, etc).

    From these 2 bits, the least significant bit, sign bit and rounding mode, the proper rounding value can be determined.