Search code examples
floating-pointdecimalsingle-precision

Convert from large decimal number into floating point representation


I think I know how to convert a decimal number into IEEE 754 single-precision floating-point representation, but I want to make sure.

I want to represent 3.398860921 x 10^18 in IEEE 754 single-precision floating-point representation. I know how float rep. is broken down.

31th digit: sign (0 for + and 1 for -) 30-23th digits: represent the exponent 22-0th digits: represent the mantissa (significand)

so sign is obviously 0 since it's a positive number. For the exponent I came up with this (by adding 18 to 127 for the bias) and represented the exponent as: 1001 0001

For the mantissa which would be the 3.398860921 part, I continuously multiplied everything to the right of the decimal by 2, and if it was greater than 1 I marked a 1, otherwise a 0. Then took the new answer and again multiplied everything to the right of the decimal by 2, until I came up with enough bits to fill the mantissa.

So now I have: 0 | 1001 0001 | 0110 0110 0001 1011 1011 111

so when I convert this into HEX, I get 0x48B30DDF but that is a different number than I began with in the 3.398860921 x 10^18

Is that supposed to be like that or did I make a mistake somewhere? Any help would be greatly appreciated.


Solution

  • You cannot use the decimal exponent for the IEEE 754 representation. The IEEE 754 expects a binary exponent, that is, the number p when the number is represented as 1.xxx… * 2p.

    And you cannot use what is the mantissa from the decimal scientific notation directly converted to binary, since it only makes sense in relation to the decimal exponent, that you cannot use directly.

    The algorithm is to convert the entire number to binary and then, and then only, for the significand, to take the 23 bits that follow the leading bit. For the exponent, count the position of the leading bit.

    For your particular value of 3.398860921 x 1018, the binary representation is 1.0111100101011001011011111111111101101001010010111101×261 according to Wolfram Alpha.

    This means that the unbiased exponent is 61 and a tentative significand with leading bit omitted is 01111001010110010110111. You can compute the error of the conversion from decimal to floating-point as 0.0000000000000000000000011111111101101001010010111101×261, and since this error is larger than half the ULP, you should, unless you have reasons to prefer to round downwards, add one to the significand in order to obtain the nearest single-precision value to the original number as expressed in decimal.