Search code examples
floating-pointdoublebit-representation

Representation of double numbers


In an 8 bit representation, we know that the number 4 is stored as 00000100, and the number -4 is stored as 11111100. But how the number 4.6 is stored in a double?


Solution

  • Note: the question could be more specific about whether you want to know what format a particular programming language or system uses to represent doubles. This would help me narrow my answer and discard irrelevant segments.

    That being said, here is my answer:

    The format you describe for representing 4 and -4 is called two’s complement. It allows the highest order bit to represent sign, which means negative and positive numbers can be represented by the bits that make up the representation of the number.

    Floating point numbers are commonly stored in the IEEE-754 format, a separate format from that of integers and other “whole” numbers.

    The format essentially separates the binary representation into three segments: sign, exponent, and fraction.

    The sign is a bit, representing either positive (0) or negative (1). The other two vary in size, but it is very similar to scientific notation if you are familiar with that system.

    Let’s assume we have decided on using 32 bits to represent a fractional number. One bit is reserved for sign, so we have 31 bits to store the actual value of the number.

    0 00000000 00000000000000000000000 Sign Exponent Fraction

    For the exponent, we want both positive and negative exponents to represent really large and really small numbers. The IEEE-754 standard could have used the familiar system you describe to store these exponents, but they opted for a different system. Instead we identify a bias which is 2(number of bits in the exponent segment - 1)-1. If we are using 8 bits for the exponent segment as in my example, the bias is 27-1 or 127.

    The exponents of all 1s and of all 0s are both reserved. Therefore, the highest and lowest exponents we can represent with this system are -126 and 127, respectively.

    Let’s say you want to represent 1.4^2. 2 is your exponent.
    Our bias is 127, so you store the exponent as 2+127, or 129.

    Now for the fraction. The fractional component of a number must be strictly greater than or equal to 0 and less than 1. Stick with me here, but consider decimal numbers, and how they work.

    1.2 = 1 + 2/10 = 1*100 + 2*10-1
    0.0147 = 0/10 + 1/100 + 4/1000 + 7/10000 = 0*10-1 + 1*10-2 + 4*10-3 + 7*10-4

    The trend here is that a decimal number can be decomposed into a sum of its digits multiplied by successive powers (as you move away from the .) of the base of the number system used to write it.

    Now consider this number:
    0.01101
    It is written in a representation called a "binary fraction". In much the same way as before, this number can be written as a sum where the denominator is successively higher powers of 2, the base, as we move away from the point:

    0.01101 = 0/2 + 1/4 + 1/8 + 0/16 + 1/32 = 0*2-1 + 1*2-2 + 1*2-3 + 0*2-4 + 1*2-5

    Now that I have described how binary point numbers work, lets use them in our representation of floating point numbers.

    The fractional segment of the representation will be whatever value you wish to represent, as a binary fraction, shifted into the range [0,1).

    Example:
    34.25 (= 3*101 + 4*100 + 2*10-1 + 5*10-2 = 3 * 10 + 4 * 1 + 2 / 10 + 5 / 100 = 137/4)
    Convert to binary point:
    100010.01 (= 1 * 25 + 0 * 24 + 0*23 + 0*22 + 1*21 + 0*20 + 0*2-1 + 1*2-2 = 32 + 4 + 1/4 = 137/4)
    Shift to range [0,1):
    1.0001001 * 25
    This is the number that will be stored in our floating point format.

    The sign: 0 (for positive)
    The exponent: 5 + the bias, 127 = 132 = 10000100
    The fraction: 1.0001001 - 1 = .0001001 (remove point, add trailing zeros to fill segment) = 00010010000000000000000

    So our complete floating point representation for 34.25 is the following:
    0 10000100 00010010000000000000000
    without spacing:
    01000010000010010000000000000000

    So to extract our value, perform the following operation:
    (-1)sign * (1+fraction) * 2exponent - bias

    A benefit of this representation is that things like infinity and NaN (“Not a Number’) can be represented as well through those reserved exponents.

    You can find more details by looking into the IEEE-754 standard.

    The reality is, you can store it any way you like because the bits only mean what you decide for them to mean to your programs. But the standard way to store them is the IEEE-754 standard.

    Downsides of the standard representation include:
    1. Lossy representation
    2. Arithmetic inaccuracy
    3. Effectively divides representable exponential range in half with sign bit, other types avoid this by having signed and unsigned versions)

    So it is not always desirable to go with standard binary representations.