Search code examples
cfloating-pointieee-754

Why do we multiply normalized fraction with 0.5 to get the significand in IEEE 754 representation?


I have a question about the pack754() function defined in Section 7.4 of Beej's Guide to Network Programming.

This function converts a floating point number f into its IEEE 754 representation where bits is the total number of bits to represent the number and expbits is the number of bits used to represent only the exponent.

I am concerned with single-precision floating numbers only, so for this question, bits is specified as 32 and expbits is specified as 8. This implies that 23 bits is used to store the significand (because one bit is sign bit).

My question is about this line of code.

    significand = fnorm * ((1LL<<significandbits) + 0.5f);

What is the role of + 0.5f in this code?

Here is a complete code that uses this function.

#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros

uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
    long double fnorm;
    int shift;
    long long sign, exp, significand;
    unsigned significandbits = bits - expbits - 1; // -1 for sign bit

    if (f == 0.0) return 0; // get this special case out of the way

    // check sign and begin normalization
    if (f < 0) { sign = 1; fnorm = -f; }
    else { sign = 0; fnorm = f; }

    // get the normalized form of f and track the exponent
    shift = 0;
    while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
    while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
    fnorm = fnorm - 1.0;

    // calculate the binary form (non-float) of the significand data
    significand = fnorm * ((1LL<<significandbits) + 0.5f);

    // get the biased exponent
    exp = shift + ((1<<(expbits-1)) - 1); // shift + bias

    // return the final answer
    return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}

int main(void)
{
    float f = 3.1415926;
    uint32_t fi;

    printf("float f: %.7f\n", f);

    fi = pack754(f, 32, 8);
    printf("float encoded: 0x%08" PRIx32 "\n", fi);

    return 0;
}

What purpose does + 0.5f serve in this code?


Solution

  • The code is an incorrect attempt at rounding.

    long double fnorm;
    long long significand;
    unsigned significandbits
    ...
    significand = fnorm * ((1LL<<significandbits) + 0.5f);  // bad code
    

    The first clue of incorrectness is the f of 0.5f, which indicates float, is a nonsensical introduction of specifying float in a routine with long double f and fnorm. float math has no application in the function.

    Yet adding 0.5f does not mean that the code is limited to float math in (1LL<<significandbits) + 0.5f. See FLT_EVAL_METHOD which may allow higher precision intermediate results and have fooled the code author in testing.

    A rounding attempt does make sense as the argument is long double and the target representations are narrower. Adding 0.5 is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f hinted that the intent was "obvious" - not subtle, albeit incorrect.

    As commented, moving the 0.5 is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float math, (it is long double math adding a long doubleproduct to float causes the 0.5f to be promoted to long double first).

    // closer to rounding but may mislead
    significand = fnorm * (1LL<<significandbits) + 0.5f;
    
    // better
    significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
    

    To round, without calling the preferred <math.h> rounds routines like rintl(), roundl(), nearbyintl(), llrintl(), adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.

    Consider

    long double product = fnorm * (1LL<<significandbits);
    long long significand = product + 0.5;  // double rounding?
    

    product + 0.5 itself may go through a rounding before truncation/assignment to long long - in effect double rounding.

    Best to use the right tool in the C shed of standard library functions.

    significand = llrintl(fnorm * (1ULL<<significandbits));
    

    A corner case remains with this rounding is where significand is now one too great and significand , exp needs adjustment. As well identified by @Nayuki, code has other short-comings too. Also, it fails on -0.0.