In numerical computation, it is often needed to scale numbers to be in safe range.
For example, computing Euclidean distance: sqrt(a^2+b^2)
. Here, if magnitude of a
or b
is too small/large, then underflow/overflow can happen.
A common approach to solve this is to divide numbers by the largest magnitude number. However, this solution is:
So I thought that instead of dividing by the largest magnitude number, let's multiply it with a close power-of-2 reciprocal number. This seems a better solution, as:
So, I'd like to create a small utility function, which has a logic like this (by ^
, I mean exponentiation):
void getScaler(double value, double &scaler, double &scalerReciprocal) {
int e = <exponent of value>;
if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }
} else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }
} else { scaler=2^e; scalerReciprocal = 2^(2046-e); }
}
This function should return a normalized scaler
& scalerReciprocal
, both are power-of-2 numbers, where scaler
is near to value
, and scalerReciprocal
is the reciprocal of scaler
.
The maximum allowed exponents for scaler
/scaleReciprocal
are -1022..1022
(I don't want to work with subnormal scaler
, as subnormal numbers can be slow).
What would be a fast way to do this? Can this be done with pure floating-point operations? Or should I extract the exponent from value
, and use simple if
s to do the logic? Is there some kind of trick to do the comparison with (-)1022 fast (as the range is symmetric)?
Note: scaler
doesn't need to be the closest possible power-of-2. If some logic needs it, scaler
can be some small power-of-2 away from the closest value.
Based on wim's answer, here's another solution, which can be faster, as it has one less instruction. The output is a little bit different, but still fulfills the requirements.
The idea is to use bit operations to fix border cases: put a 01
to the lsb of the exponent, no matter of its value. So, exponent:
00
to 01
), or halved (when 10->01) or 1/4 (when 11->01)So, this modified routine works (and I think that it's pretty cool that the problem can be solved with only 2 fast asm instructions):
#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */
union dbl_int64{
double d;
uint64_t i;
};
double get_scale(double t){
union dbl_int64 x;
uint64_t and_i;
uint64_t or_i;
/* 0xFEDCBA9876543210 */
and_i = 0x7FD0000000000000ull;
or_i = 0x0010000000000000ull;
x.d = t;
x.i = (x.i & and_i)|or_i; /* Set fraction bits to zero, take absolute value */
return x.d;
}
double get_scale_x86(double t){
__m128d x = _mm_set_sd(t);
__m128d x_and = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
__m128d x_or = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
x = _mm_and_pd(x, x_and);
x = _mm_or_pd(x, x_or);
return _mm_cvtsd_f64(x);
}
/* Compute the inverse 1/t of a double t with all zero fraction bits */
/* and exponent between the limits of function get_scale */
/* A single integer subtraction is much less expensive than a */
/* floating point division. */
double inv_of_scale(double t){
union dbl_int64 x;
/* 0xFEDCBA9876543210 */
uint64_t inv_mask = 0x7FE0000000000000ull;
x.d = t;
x.i = inv_mask - x.i;
return x.d;
}
double inv_of_scale_x86(double t){
__m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
__m128d x = _mm_set_sd(t);
__m128i x_i = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}
int main(){
int n = 14;
int i;
/* Several example values, 4.94e-324 is the smallest subnormal */
double y[14] = { 4.94e-324, 1.1e-320, 1.1e-300, 1.1e-5, 0.7, 1.7, 123.1, 1.1e300,
1.79e308, -1.1e-320, -0.7, -1.7, -123.1, -1.1e307};
double z, s, u;
printf("Portable code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale(z);
u = inv_of_scale(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
printf("\nx86 specific SSE code:\n");
printf(" x pow_of_2 inverse pow2*inv x*inverse \n");
for (i = 0; i < n; i++){
z = y[i];
s = get_scale_x86(z);
u = inv_of_scale_x86(s);
printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
}
return 0;
}