Search code examples
floating-pointieee-754subtraction

how to implement IEEE 754 floating point subtraction in hardware?


I searched the google and with no luck to find the answer of what I'm looking for.

Imagine these 2 numbers in base 2:

A = 1.0001 * e-4
B = 1.001 * e-6

So to subtract these 2 numbers we need to shift the 2end one 2 bits right, to have the same exponents. So now we have:

A = 1.0001 * e-4
B = 0.001001 * e-4

Now our exponents are the same and we should do subtraction for significands, which means:

  1.000100
- 0.001001
----------
  0.111011

Then we do the normalization and rounding process on the result. As a human we know how to deal with this subtraction, but what about HW? Does it use any special algorithm for making negative the B number (algorithm like 2's complement just like what we do for Integers)? This question is also valid when we want to do A+B but the B is a negative number.


Solution

  • how to implement IEEE 754 floating point subtraction in hardware?

    To be clear: IEEE 754 does not dictate how HW must perform a subtraction, only that given 2 inputs and rounding mode what the results must be. The HW is a black box in the middle where a miracle occurs.


    Sample HW subtract algorithm:

    Assume N=8 bit significand, same sign for a b (else use addition), and |a| >= |b| > 0 (else swap operands). Bit x is 0 or 1.

      1.xxxxxxx e AA
    - 1.xxxxxxx e BB
    

    Use N+2 wide register. Find shift needed AA - BB. The shift will leave some of the b bits in the N+2 register, and some to the "right".

      1.xxxxxxx_00     e AA
    - 0.00001xx_xx xxx e AA
    

    From the xxx shifted out, set a "borrow bit c" to are any of them 1?

                 c <-- Initial borrow bit                    
      1.xxxxxxx_00 
    - 0.00001xx_xx
    

    Now perform the subtraction in the usual fashion.

    To ease the function explanation, consider 2 cases: no initial shift/shift, even though HW would use a single common path.

      // Result with no shift.  Initial borrow bit 'c' is then zero.
      0.1111111_00 Max value
      0.0000000_00 Min-value  (a == b)
    

    The result is shifted left and the exponent decremented until the lead bit is 1. The subtraction is exact. A zero result is handled specially (not shown).


    In case 2, with shifting, 'c' is 0 or 1.

      // Result with shift
      1.1111110_11 Max value.
      0.1111111_1x Lowest-values. 
    

    If the lead bit is 0, the result is shifted left and exponent decremented. It is the lowest-values sub-cases that oblige a N+2 instead of a N+1 register to get the rounding (below) correct.


    Now rounding occurs. First the s and c bit are or'ed (is either one?) to form the new C. Various rounding modes like up, down, truncate, toward infinity and the popular "round to nearest with ties to even" can be deduced solely from the sign, o, r and C. The result is exact when r and C are zero.

              o rs c
      1.xxxxxxx_xx c
      -->
      1.xxxxxxx_x  C
    

    Now add the round bit R.

      1.xxxxxxx
      0.000000R
    

    This sum may result in 10.0000000. In which case the result is shifted right and the exponent incremented.