How are floating point numbers stored inside the CPU?

I am a Beginner and going through Assembly basics. Now while reading the matter, I came to this paragraph. It is explaining about how floating point numbers are stored inside memory.

The exponent for a float is an 8 bit field. To allow large numbers or small numbers to be stored, the exponent is interpreted as positive or negative. The actual exponent is the value of the 8 bit field minus 127. 127 is the "exponent bias" for 32 bit floating point numbers. The fraction field of a float holds a small surprise. Since 0.0 is defined as all bits set to 0, there is no need to worry about representing 0.0 as an exponent field equal to 127 and fraction field set to all O's. All other numbers have at least one 1 bit, so the IEEE 754 format uses an implicit 1 bit to save space. So if the fraction field is 00000000000000000000000, it is interpreted as 1 . 00000000000000000000000. This allows the fraction field to be effectively 24 bits. This is a clever trick made possible by making exponent fields of OxOO and OxFF special.

I am not getting it at all.

Can you explain me how they are stored inside memory ? I don't need references, I just need a good explanation so that I can easily understand.

Solution

Floating point numbers follow the IEEE754 standard. They have been using this set of rules mainly because floating point numbers can be (relatively) easily compared to integers and to other floating point numbers too.

There are 2 common versions of floating points: 32bit (IEEE binary32 aka single-precision float) and 64bit (binary64 aka double precision). The only difference between them is the size of their fields:

exponent: 8 bits for 32bit, 11 bits for 64bit
mantissa: 23 bits for 32bit, 52 bits for 64bit

There's an additional bit, the sign bit, that specifies if the considered number is positive or negative.

Now, take for example 12,375 base 10 (32bit):

First step is to convert this number in base 2: it's pretty easy, after some calculations you will have: 1100.011
Next you have to move the "comma" until you get 1.100011 (until the only digit before the . is a 1). How many times we move the comma? 3, that is the exponent. It means that our number can be represented as 1.100011*2^3. (It's not called a decimal point because this is binary. It's a "radix point" or "binary point".)

Moving the . around (and counting those moves with the exponent) until the mantissa starts with a leading 1. is called "normalizing". A number that's too small to be represented that way (limited range of the exponent) is called a subnormal or denormal number.
After that we have to add the bias to the exponent. That's 127 for the 8-bit exponent field in 32bit floats. Why should we do this? Well the answer is: because in this way we can more easily compare floating points with integers. (Comparing FP bit-patterns as integer tells you which one has larger magnitude, if they have the same sign.) Also, incrementing the bit-pattern (including carry from the mantissa into exponent) increases the magnitude to the next representable value. (nextafter())

If we didn't do this a negative exponent would be represented using two-complement notation, essentially putting a 1 in the most significant bit. But in this way a smaller floating point seems to be greater than a positive-exponent floating point. For this reason: we just add 127, with this little "trick" all positive exponents starts from 10000000 base 2 (which is 1 base 10) while negative exponents reach at most 01111110 base 2 (which is -1 base 10).

In our example the normalized exponent is 10000010 base 2.

Last thing to do is add mantix (.100011) after the exponent, the result is:
```
01000001010001100000000000000000
 |  exp ||      mantix         |
```

(first bit is the sign bit)

There's a nice online converter that visualizes the bits of a 32-bit float, and shows the decimal number it represents. You can modify either and it updates the other. https://www.h-schmidt.net/FloatConverter/IEEE754.html

That was the simple version which is a good start. It simplified by leaving out:

Not-A-Number NaN (biased exponent = all-ones; mantissa != 0)
+-Infinity (biased exponent = all-ones; mantissa = 0)
and didn't say much about subnormal numbers (biased exponent = 0 implies a leading 0 in the mantissa instead of the normal 1).

The Wikipedia articles on single and double precision are excellent, with diagrams and lots of explanation of corner cases and details. See them for the complete details.

Also, some (mostly historical) computers use FP formats that aren't IEEE-754.

And there are other IEEE-754 formats, like 16-bit half-precision, and one notable extended-precision format is 80-bit x87 which stores the leading 1 of the significand explicitly, instead of implied by a zero or non-zero exponent.

IEEE-754 even defines some decimal floating-point formats, using 10^exp to exactly represent decimal fractions instead of binary fractions. (HW support for these is limited but does exist).