Search code examples
floating-pointbinaryieee-754

binary floating point addition algorithm


I'm trying to understand IEEE 754 floating point addition at a binary level. I have followed some example algorithms that I have found online, and a good number of test cases match against a proven software implementation. My algorithm is only dealing with positive numbers at the moment. However, I am not getting a match with this test case:

00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)

I split it up into sign bit, exponent, mantissa. I add back in the implicit 1 to the mantissa

0 00010001 1.11100110110010010011100
0 00000000 1.00011000111111010000100

I subtract the larger exponent from the smaller in order to determine the realignment-shift amount:

 00010001 (17)
-00000000 (0)
 =============
           17

I tack on a Guard bit, Round Bit, and Sticky Bit to the mantissas:

1.11100110110010010011100 000
1.00011000111111010000100 000

I shift the lesser value's mantissa to the right 17 times, with the LSb "sticking" once it receives a 1:

0.00000000000000001000110 001

I add the greater mantissa to the shifted lesser mantissa:

1.11100110110010010011100 000 +
0.00000000000000001000110 001
================================
1.11100110110010011100010 001

Since there was no overflow, and the guard bit is 0, I can use the summation-mantissa and greater-exponent directly (re-removing the implicit '1'):

0 00010001 11100110110010011100010

Giving a final value of:

00001000111100110110010011100010 (1.46487e-33)

But according to my verification implementation, I should be getting:

00001000111100110110010010101000 (1.46487e-33)

So very close but not exact. Is there a mistake in my algorithm?


Solution

  • There appear to be two problems in the calculation, both related to treating a subnormal number as though it were normal:

    1. Incorrect shift calculation. The exponent is -126, not -127.
    2. Incorrectly inserting a one bit before the binary point.

    Here is the revised calculation:

    0 00010001 1.11100110110010010011100
    0 00000000 0.00011000111111010000100
    

    Tack on a Guard bit, Round Bit, and Sticky Bit to the mantissas:

    1.11100110110010010011100 000
    0.00011000111111010000100 000
    

    16 bit right shift of smaller number.

    0.00000000000000000001100 001
    

    Add the greater mantissa to the shifted lesser mantissa:

    1.11100110110010010011100 000 +
    0.00000000000000000001100 001
    ================================
    1.11100110110010010101000 001