Search code examples
floating-pointverilogsystem-verilogieee-754twos-complement

How to normalize the sum of two IEEE754 single precision numbers?


I am designing a floating point unit in SystemVerilog that takes two 32-bit inputs that are in IEEE754 format, adds them together, and outputs the result in the same 32-bit IEEE754 format.

It is when you need to move the "leftmost" 1 to the correct bit, which should be bit 23.(starting with bit 0)

I don't know how to identify what the leftmost 1 bit is so I can shift the mantissa and increment/decrement the exponent bits appropriately.

Addition:

  • Separate the bits into sign, exponent, and mantissa
  • Prepend a '1' to the mantissas
  • Compare exponents and add the difference to the smaller exponent
  • Shift the mantissa of the smaller exponent to the right by said difference to 'line' up the decimals/number correctly
  • Perform binary addition
  • Normalize the result if necessary

I have every step except the normalizing part correct.

How can I tell if my result needs to be normalized if all I have are bits?

It is not normalized if the result is not 1. then a fraction. Eg 10.10101 * 2^1 should be normalized to 1.010101 *2^2 and .1001 * 2^2 should be normalized to 1.001 * 2^1.

How I can keep track of where the decimal place is after adding two numbers?

For example: Adding input a, 0x3fc00000 (1.5), and b, 0x40500000 (3.25):

a = 0 | 0111 1111 | (1) 100 0000 0000 0000 0000 0000
b = 0 | 1000 0000 | (1)101 0000 0000 0000 0000 0000

The exponent of a is less than b by a difference of 1, so:

a = 0 | 1000 0000 | 0(1)10 0000 0000 0000 0000 0000
b = 0 | 1000 0000 | (1)101 0000 0000 0000 0000 0000

Adding the mantissas gives:

1 0011 0000 0000 0000 0000 0000

The leftmost 1 is bit 24 as opposed to bit 23, so we shift the mantissa to the right by 1 and increment the exponent to normalize the result. Then we remove the leftmost 1 because it is implied in IEEE754 format:

0 | 1000 0001 | 001 1000 0000 0000 0000 0000 (4.75)

This is our final output, which is correct.

Given this example, I thought I just had to check for the following cases:

  • If bit 24 of the mantissa is equal to 1, shift mantissa right and increment exponent
  • Else check bit 23 is 1, if true no normalization needed
  • Else check bit 22 is 1, then shift mantissa left and decrement exponent

However, I'm only finding this to be true for some cases.

What am I missing?

In my implementation I made a 26-bit value to hold the sum of the two mantissas, but I don't know if that is correct. Bit 25 is the sign of the mantissa, which I don't need, and bits 24 and 23 are the hidden bits, or bits that won't be included in the final output.

For example: 0x449ebbc8 (1269.868163) + 0xc60eb709 (-9133.758561) gives me the following mantissa:

11 0111 1010 1101 1111 1001 0000

This is 26 bits(25:0)

If I followed the previous case that would mean the leftmost 1 bit excluding the sign bit would bit 24, meaning I would shift the mantissa right and increment the exponent. However the correct answer is the opposite! The 'true' leftmost 1 bit is actually bit 22! Meaning I should shift left and decrement instead! Giving me the final output of:

1 | 10001011 | 111 0101 1011 1111 0010 0000 (-7863.8906) which is correct.

Similarly, adding 0x45c59cbd and 0xc473d9dc gives a mantissa of

01 1010 0111 0010 0001 1000 0010 but the "leftmost 1" bit is not the one at bit 24, but bit 23, so no normalization is needed.

Why is it that for the first case I needed to worry about bit 24 but not the other two cases?

Is it because I'm adding opposite signs for the other cases? Overflow problem?


Solution

  • Consider adding two positive normal numbers in the IEEE-754 basic 32-bit binary format. When their significands1 are completed by prefixing the leading bit, shifted to align the exponents, and added, the leading bit is either in the same position (because no carry occurred) or one to the left (because a carry occurred). To normalize this, simply shift one bit right if a carry occurred.

    (If both numbers are subnormal, the leading bit may be further to the right. However, no normalization will be done, as the result either carried into the position that makes it normal [so no normalization is needed] or did not carry into that position [so the result is still subnormal and cannot be normalized].)

    If both numbers are negative, the same situation holds. The significands may be treated as absolute values, ignoring the sign bits.

    If the number have opposing signs, there are complications. The question describes prefixing a sign bit to the significand. This would not appear to lead to a correct result. For example, consider adding +1.125 and −1.125. The four-bit significand of each number is 1001. Prefixing the sign bits gives us 01001 and 11001, respectively. Then adding those gives 1 00010 (the new leftmost digit comes from a carry out of the previous leftmost position). Regardless of how we treat the leading bits, the low bits are wrong—0010 is not correct; since +1.125 + −1.125 = 0, the result ought to be 0000 with some sign. So merely prefixing the sign bit to a significand is not a correct procedure.

    Every description of implementing floating-point addition I recall specifies using subtraction instead of addition when the signs are opposed. In this case, one subtracts the smaller (or equal) number from the larger (or equal) number and then must shift left some number (possibly zero) of bits.

    In this model, determining how to normalize the number becomes simpler:

    • When adding like-sign numbers, normalization requires shifting right zero or one bits, according to whether there was a carry out from the high position. (Note that exponent overflow may occur.)
    • When subtracting opposite-sign numbers, normalization requires shift left until the leading one bit is in the proper position or the minimum exponent is reached.

    I expect it is possible to implement the mixed-sign case using addition and two’s complement arithmetic. In this case, one should not merely prefix the sign bit to the significand but should form the two’s complement of the significand by inverting each bit and then adding one. Once the sum is found, if it is negative, it could be two’s-complemented again and then normalized. However, you are then adding more additions, with their carry chain dependencies, to the implementation.

    Note that you must also account for rounding the result, since some bits may be lost during the shift to align exponents before adding and during the shift to normalize the result of adding like-sign numbers.


    1 “Significand” is the preferred term for the fraction portion of a floating-point number. “Mantissa” is a historic term for the fraction portion of a logarithm. Significands are linear (doubling a significand doubles the represented value) while mantissas are logarithmic (doubling a mantissa squares the portion of the value it represents).