I'm trying to solidify my understanding of IEEE-754 Floating point, but couldn't find answers to some questions below, and was wondering if anyone knew enough on the topic to provide insight. Please let me know if there is something I'm misunderstanding about FP as well!
I understand subnormals are used to solve gradual underflow (bridging gap to 0 by allowing for under-precise smallest normalized difference spaced values) in scientific computing, but why aren't the small numbers simply scaled up before usage in calculations instead? (Unless our sequence of calculations is expected to use the entire FP Dynamic range)
Are "supernormals" not included since there is no way to "bridge the gap" to infinity?
When the exp field in FP32 is all ones ('11111111'), we only represent 3 values (+inf/-inf/NaN), so isn't the rest of the "address space" (sorry, don't know the terminology) being wasted? Why not use it for an extra power of 2?
Lastly, for FP32, why is the bias 127, rather than, say, 150, which would allow small numbers to be represented better at the cost of larger numbers? Was 127 chosen arbitrarily since it is the midpoint of the 256-size range?
… why aren't the small numbers simply scaled up before usage in calculations instead? (Unless our sequence of calculations is expected to use the entire FP Dynamic range)
Are you positing that an alternative to a floating-point implementation supporting subnormals is that applications using floating-point would adjust the scales of their numbers to avoid creating subnormal results?
That adds a burden to applications. The point of providing a service is to provide a good service and relieve users of the service from some work.
Are "supernormals" not included since there is no way to "bridge the gap" to infinity?
Subnormals solve some problems such as x-y == 0
being true while x == y
is false, which can lead to some bugs. For example, software that tested x == y
and found it to be false would expect to be able to calculate c / (x-y)
without getting a division-by-zero exception, but that would not be true without subnormal support.
What sort of issues would you expect supernormals to solve, and what sort of implementation do you propose for them?
When the exp field in FP32 is all ones ('11111111'), we only represent 3 values (+inf/-inf/NaN), so isn't the rest of the "address space" (sorry, don't know the terminology) being wasted? Why not use it for an extra power of 2?
When the floating-point datum is a NaN, the significand field may be used for additional information. For example, it could store the address, or part of the address, of the instruction where the NaN was created. Or NaNs used to initialize data could be different from NaNs created by floating-point operations during program execution, so the final output of a program could reveal whether a NaN is in the output because some program error result in using data that was never initialized with a proper value or because some calculation created a NaN.
The significand field is not used for this as widely as may have been hoped when the IEEE-754 standard was created, but there are systems that do such things.
Lastly, for FP32, why is the bias 127, rather than, say, 150, which would allow small numbers to be represented better at the cost of larger numbers? Was 127 chosen arbitrarily since it is the midpoint of the 256-size range?
We want something near the midpoint because programs that use x tend to also use 1/x.