I have a question about the pack754()
function defined in Section 7.4 of Beej's Guide to Network Programming.
This function converts a floating point number f
into its IEEE 754 representation where bits
is the total number of bits to represent the number and expbits
is the number of bits used to represent only the exponent.
I am concerned with single-precision floating numbers only, so for this question, bits
is specified as 32
and expbits
is specified as 8
. This implies that 23
bits is used to store the significand (because one bit is sign bit).
My question is about this line of code.
significand = fnorm * ((1LL<<significandbits) + 0.5f);
What is the role of + 0.5f
in this code?
Here is a complete code that uses this function.
#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
int main(void)
{
float f = 3.1415926;
uint32_t fi;
printf("float f: %.7f\n", f);
fi = pack754(f, 32, 8);
printf("float encoded: 0x%08" PRIx32 "\n", fi);
return 0;
}
What purpose does + 0.5f
serve in this code?
The code is an incorrect attempt at rounding.
long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f); // bad code
The first clue of incorrectness is the f
of 0.5f
, which indicates float
, is a nonsensical introduction of specifying float
in a routine with long double f
and fnorm
. float
math has no application in the function.
Yet adding 0.5f
does not mean that the code is limited to float
math in (1LL<<significandbits) + 0.5f
. See FLT_EVAL_METHOD
which may allow higher precision intermediate results and have fooled the code author in testing.
A rounding attempt does make sense as the argument is long double
and the target representations are narrower. Adding 0.5
is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f
hinted that the intent was "obvious" - not subtle, albeit incorrect.
As commented, moving the 0.5
is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float
math, (it is long double
math adding a long double
product to float
causes the 0.5f
to be promoted to long double
first).
// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;
// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
To round, without calling the preferred <math.h>
rounds routines like rintl(), roundl(), nearbyintl(), llrintl()
, adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.
Consider
long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5; // double rounding?
product + 0.5
itself may go through a rounding before truncation/assignment to long long
- in effect double rounding.
Best to use the right tool in the C shed of standard library functions.
significand = llrintl(fnorm * (1ULL<<significandbits));
A corner case remains with this rounding is where significand
is now one too great and significand , exp
needs adjustment. As well identified by @Nayuki, code has other short-comings too. Also, it fails on -0.0
.