In this answer to a question about normalization of quaternions, the author provides some code which calculates the inverse square root, using 2.0 / (1.0 + qmagsq)
as an approximation for 1.0 / std::sqrt(qmagsq)
for values quite near 1:
double qmagsq = quat.square_magnitude();
if (std::abs(1.0 - qmagsq) < 2.107342e-08) {
quat.scale (2.0 / (1.0 + qmagsq));
} else {
quat.scale (1.0 / std::sqrt(qmagsq));
}
The author then provides the following explanation:
For values of
qmagsq
between 0 and 2, the error in this approximation is less than(1-qmagsq)^2 / 8
. The magic number 2.107342e-08 represents where this error is more that half an ULP for IEEE doubles.
Presumably, this is because sqrt(8 * 2^-(1+52) / 2)
is approximately 2.10734243e-8, where 2^-(1+52) / 2
is half the precision of double
.
How would one derive (1-qmagsq)^2 / 8
as an upper bound of the error of this approximation for values of qmagsq
between 0 and 2?
Edit:
It has been pointed out that the error bound provided by the author does not actually hold for values of qmagsq
between 0 and 1. As a result, the question becomes a bit more open:
How would one derive an error bound for this approximation that can be used to determine the range that the approximation has an error of less than half an ULP for IEEE doubles?
Computers have gotten fast enough that it is feasible to test particular assertions for single-argument functions exhaustively over the entire input domain for single precision, and for reasonably small intervals for double precision. The practical bound is typically around 248 test vectors or so. I am assuming that an IEEE-754 compatible platform is used, with default rounding mode round-to-nearest-or-even, and that all code is built with the strictest adherence to IEEE-754 that the compiler can muster (for my Intel compiler, this is /fp:strict
, for example).
The claim in the question is that the fast replacement achieves an error of 0.5 ulp or less in the vicinity of unity. In other words, the results are correctly rounded using the IEEE-754 rounding mode to-nearest-or-even. There are two ways of testing that assertion: Either use a correctly-rounded implementation of rsqrt()
as a reference and iterate the argument in 1 ulp steps until a mismatch is found, or use a multi-precision library as the reference, and stop when the ulp error of the fast alternative exceeds 0.5 ulps. In the latter case, we need reference results that are a little bit over twice as accurate as double precision to avoid double-rounding effects. For the reciprocal square root, a reference with 2n+3 bits suffices:
Cristina Iordache and David W. Matula: “On Infinitely Precise Rounding for Division, Square Root, Reciprocal and Square Root Reciprocal”. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic, Adelaide, Australia, April 14-16, 1999, pp. 233–240
The ISO-C99 code below uses the first approach. It start the search at unity and then searches either in the direction of zero or the direction of two, stopping at the first mismatch. The output is as follows:
arg = (1.0 + 2.2204460492503131e-016) quick_rsqrt = 0x1.0000000000000p+0 (1.0000000000000000e+000) rsqrt_rn = 0x1.fffffffffffffp-1 (9.9999999999999989e-001)
arg = (1.0 - 1.2166747276332046e-008) quick_rsqrt = 0x1.0000001a20bd7p+0 (1.0000000060833736e+000) rsqrt_rn = 0x1.0000001a20bd8p+0 (1.0000000060833738e+000)
I also tried the second approach, and got matching results. The error of the fast replacement at (1.0 + 2.2204460492503131e-16) is 0.9995 ulps, and the error at (1.0 - 1.2166747276332046e-8) is 0.5002 ulps.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
/* function under test */
double quick_rsqrt (double a)
{
return 2.0 / (1.0 + a);
}
/* starting approximation for reciprocal square root */
double simple_rsqrt (double a)
{
return 1.0 / sqrt (a);
}
/* most significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t hi_uint32_of_double (double a)
{
uint64_t t;
memcpy (&t, &a, sizeof t);
return (uint32_t)(t >> 32);
}
/* least significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t lo_uint32_of_double (double a)
{
uint64_t t;
memcpy (&t, &a, sizeof t);
return (uint32_t)t;
}
/* construct IEEE-754 binary64 from upper and lower half of its bit representation */
double mk_double_from_hilo_uint32 (uint32_t hi, uint32_t lo)
{
double r;
uint64_t t = ((uint64_t)hi << 32) + ((uint64_t)lo);
memcpy (&r, &t, sizeof r);
return r;
}
/* reciprocal square root, rounded to-nearest-or-even */
double rsqrt_rn (double a)
{
double y, h, l, e;
uint32_t alo, ahi, temp;
int32_t diff;
ahi = hi_uint32_of_double (a);
alo = lo_uint32_of_double (a);
if ((ahi - 0x00100000u) < 0x7fe00000u) { // positive normals
/* scale argument towards unity */
temp = (ahi & 0x3fffffff) | 0x3fe00000;
diff = temp - ahi; // exponent difference
a = mk_double_from_hilo_uint32 (temp, alo);
/* initial rsqrt approximation */
y = simple_rsqrt (a);
/* refine reciprocal square root approximation */
h = y * y;
l = fma (y, y, -h);
e = fma (l, -a, fma (h, -a, 1.0));
/* round according to Peter Markstein, "IA-64 and Elementary Functions" */
y = fma (fma (0.375, e, 0.5), e * y, y);
/* scale result near unity to correct range */
diff = diff >> 1; // adjust exponent; ensure arithmetic right shift which is not guaranteed by ISO-C99
a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + diff, lo_uint32_of_double (y));
} else if (a == 0.0) { // zeros
a = mk_double_from_hilo_uint32 ((ahi & 0x80000000) | 0x7ff00000, 0x00000000);
} else if (a < 0.0) { // negatives
a = mk_double_from_hilo_uint32 (0xfff80000, 0x00000000);
} else if (isinf (a)) { // infinities
a = mk_double_from_hilo_uint32 (ahi & 0x80000000, 0x00000000);
} else if (isnan (a)) { // NaNs
a = a + a;
} else { // positive subnormals
/* scale argument towards unity */
a = a * mk_double_from_hilo_uint32 (0x7fd00000, 0);
/* initial rsqrt approximation */
y = simple_rsqrt (a);
/* refine reciprocal square root approximation */
h = y * y;
l = fma (y, y, -h);
e = fma (l, -a, fma (h, -a, 1.0));
/* round according to Peter Markstein, "IA-64 and Elementary Functions" */
y = fma (fma (0.375, e, 0.5), e * y, y);
/* scale result near unity to correct range */
a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + 0x1ff00000, lo_uint32_of_double (y));
}
return a;
}
int main (void)
{
double x, ref, res;
/* Try arguments greater than unity */
x = 1.0;
do {
res = quick_rsqrt (x);
ref = rsqrt_rn (x);
if (res != ref) {
printf ("arg = (1.0 + %23.16e) quick_rsqrt = %21.13a (%23.16e) rsqrt_rn = %21.13a (%23.16e)\n",
x - 1.0, res, res, ref, ref);
break;
}
x = nextafter (x, 2.0);
} while (x < 2.0);
/* Try arguments less than unity */
x = 1.0;
do {
res = quick_rsqrt (x);
ref = rsqrt_rn (x);
if (res != ref) {
printf ("arg = (1.0 - %23.16e) quick_rsqrt = %21.13a (%23.16e) rsqrt_rn = %21.13a (%23.16e)\n",
1.0 - x, res, res, ref, ref);
break;
}
x = nextafter (x, 0.0);
} while (x > 0.0);
return EXIT_SUCCESS;
}