Search code examples
cfloating-pointmultiplicationadditionieee-754

How to correctly implement multiply for floating point numbers (software FP)


My program is about a method which is given floats and in this method I want to multiply or add those floats. But not multiply like a * b, I want to break those floats down to their structure like the bit for the sign, the 8 bit for the exponent and the rest of the bits as the mantissa.

I want to implement / emulate software floating-point add and multiply (to learn more about what FP hardware has to do).


In the head of the program there are the breakdowns:

    #define  SIGN(x) (x>>31);
    #define  MANT(x) (x&0x7FFFFF);
    #define  EXPO(x) ((x>>23)&0xFF);

    #define SPLIT(x, s, m, e) do {  \
    s = SIGN(x);                    \
    m = MANT(x);                    \
    e = EXPO(x);                    \
    if ( e != 0x00 && e != 0xFF ) { \
      m |= 0x800000;                \
    }                               \
    } while ( 0 )  

    #define BUILD(x, s, m, e) do {               \
        x = (s << 31) | (e<<23) | (m&0x7FFFFF);  \
    } while ( 0 )

The main looks as follows:

    float f = 2.3; 
    float g = 1.8; 
    float h = foo(&f, &g);

And the method for the calculation looks like:

   float foo(float *a, float *b)  {
   uint32_t ia = *(unsigned int *)a;
   uint32_t ib = *(unsigned int *)b;
   uint32_t result = 0;
   uint32_t signa, signb, signr; 
   uint32_t manta, mantb, mantr; 
   uint32_t expoa, expob, expor; 
   SPLIT(ia, signa, manta, expoa); 
   SPLIT(ib, signb, mantb, expob); 

I already tried the multiply by adding the exponents and multiply their mantissas as follow:

   expor = (expoa -127) + (expob -127) + 127;
   mantr = (manta) * (mantb);
   signr = signa ^ signb;

The return and rebuild of the new float:

   BUILD(result, signr, mantr, expor);
   return *(float *)&result;

The problem is now, that the result is wrong. the mantr even takes a very low negative Number (in case if foo gets 1.5 and 2.4 mantr takes -838860800 and the result is 2.0000000).


Solution

  • You can't just take truncate the result of the mantissa multiply, you need to take the top 24 bits (after using the low half for rounding) and renormalize (adjust the exponent).

    Floating point operations keep the top significand bits. The most significant part of the integer product is the high bits; the low bits are further places after the decimal. (Terminology: it's a "binary point", not "decimal point", because binary floats use radix 2 (binary), not 10 (decimal).)


    For normalized inputs, the implicit leading 1 in the input significands means the 32x32 => 64-bit uint64_t product that you use to implement 24 x 24 => 48-bit mantissa multiplication will have its high bit in one of 2 possible locations, so you don't need a bit-scan to find it. A compare or single-bit-test will do.

    For subnormal inputs, that's not guaranteed so you need to check where the MSB is, e.g. with GNU C __builtin_clzll. (There are many special cases to handle for one or both inputs being subnormal, and/or the output being subnormal.)

    See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for more about the IEEE-754 binary32 format, including the implied leading 1 of the significand.

    And see @njuffa's answer for an actual tested + working implementation that does 64-bit operations as two 32-bit halves for some reason, instead of letting C do that efficiently.


    Also, return *(float *)&result; violates strict aliasing. It's only safe on MSVC. Use a union or memcpy for type punning in C99 / C11.