I am designing some double-double and quad-double arithmetic libraries for C++11. Part of the code is based off of the QD and DDFUN library from https://www.davidhbailey.com/dhbsoftware/.
struct Float64x2 {
double hi;
double lo;
};
struct Float64x4 {
double val[4];
};
Dekker-float types have some properties that regular floating point types do not. For example, being able to represent FLT_MAX + FLT_DENORM_MIN
exactly, which would normally require ~2048bits of precision to represent (2^1023 + 2^-1074
). Float80x2 requiring ~32768bits due to the larger 15bit exponent (image).
I used some of the values from QD for std::numeric_limits
as a starting point. Although I am not sure if they are the best values to use. How should std::numeric_limits::epsilon()
be defined for Dekker-floats?
epsilon()
is the difference between 1.0 and the next representable number. For double
, epsilon is 0x1.0p-52
, but I am not sure what it should be for Dekker-float types.
epsilon()
literally, it should be either numeric_limits<double>::min()
or numeric_limits<double>::denorm_min()
for both Float64x2
and Float64x4
. Since they can both represent 1.0 + FLT_MIN
and 1.0 + FLT_DENORM_MIN
exactly.dd_real
and qd_real
from QD define it as 0x1.0p-104
and 0x1.0p-208
respectively; which is probably a more useful definition of epsilon()
. This definition can be used to determine when to stop adding terms to an approximation (i.e. using a Taylor-series to approximate sin(x)
).ddfune.f90 dderfr
), which is ldexp(1.0, -numeric_limits<T>::digits)
. This gives a slightly smaller epsilon of 0x1.0p-53
for double
, and a value of 0x1.0p-106
and 0x1.0p-212
for Float64x2
and Float64x4
respectively.Which definition for epsilon()
should I provide through std::numeric_limits
for Float64x2
and Float64x4
?
Additionally, what should std::numeric_limits::<Float64x4>epsilon()
return? Should it return a double
(which can represent the value of epsilon exactly), or should it return the same type (Float64x4
)?
Float64x2
and Float64x4
are not floating-point formats (discussed below) and should not be treated as such.
dd_real
andqd_real
from QD define it as0x1.0p-104
and0x1.0p-208
respectively; which is probably a more useful definition ofepsilon()
. This definition can be used to determine when to stop adding terms to an approximation (i.e. using a Taylor-series to approximatesin(x)
).
Being useful does not constitute a reason to give a number the same name used for a feature of a different kind of format. To inform clients about properties of your formats, simply define new names.
- Taking the definition of
epsilon()
literally, it should be eithernumeric_limits<double>::min()
ornumeric_limits<double>::denorm_min()
for bothFloat64x2
andFloat64x4
. Since they can both represent1.0 + FLT_MIN
and1.0 + FLT_DENORM_MIN
exactly.…
Which definition for
epsilon()
should I provide throughstd::numeric_limits
forFloat64x2
andFloat64x4
?
Taking the definition of epsilon()
literally, you should not define it at all. C++ 2020 draft N4849 17.3.5.1 [numeric.limits.members] specifies it as:
25 Machine epsilon: the difference between 1 and the least value greater than 1 that is representable.
26 Meaningful for all floating-point types.
This informs us that epsilon()
is not meaningful for Float64x2
and Float64x4
, since they are not floating-point types. They are numeric types, in that they can represent numbers, but they do not conform to a floating-point model. A floating-point format represents a number x with an integer M (or, equivalently, a fixed-point value) and an exponent e with:
x = M•βe−p+1
with |M| < βp and specific values for integers β and p and bounds on e that are fixed characteristics of the format. This is from Handbook of Floating-Point Arithmetic by Muller et al. The C++ standard does not state the floating-point model explicitly but inherits it from the C standard, which states an equivalent model (using a fixed-point M instead of integer). The IEEE 754 standard for floating-point arithmetic uses an equivalent model.
Float64x2
and Float64x4
do not conform to this model. There are no integers β and p and bounds on e such that the set of numbers representable in the model is the set of numbers representable in Float64x2
, and similarly for Float64x4
. “Floating-point” is not a synonym for real numbers or fractions or non-integer numeric formats in general. It is a term for a specific kind of mathematical model.
Representing a number by the unevaluated sum of two floating-point numbers is a different model and is not a floating-point format as it is commonly understood by practitioners of floating-point arithmetic or as stated in the C standard and inherited by C++.
Since Float64x2
and Float64x4
are not floating-point types, they are not arithmetic types as defined in the C++ standard. 6.8.1 [basic.fundamental] 12 says “Integral and floating-point types are collectively called arithmetic types.” 17.3.5 [numeric.limits] 4 says “Specializations shall be provided for each arithmetic type,…” This does not explicitly say that specializations shall not be provided for non-arithmetic types, but observe that paragraph 6 says “Non-arithmetic standard types, such as complex<T>
(26.4.2), shall not have specializations.” That is about standard types, but it is consistent with our observations: Defining values for numeric_limits
members that were not designed to describe types other than the integer and floating-point types is problematic.
Whatever you do here goes beyond the C++ standard. You could define numeric_limits
members with new names to describe new properties or you could define some entirely new template to describe properties. Either of these could be designed reasonably without interfering with standard uses of the numeric_limits
template. What I do not recommend is using the existing names where they are inapplicable, as that can cause inconsistencies with standard uses of the template.