I need to scale big integers (several hundred bits) by a double. In particular, I need to compute
(M * factor) mod M
where M is a big integer and factor is a double. I won't be using any libraries unless you want to call a dozen lines of code in a header file a 'library'; hence big float math is not an option here.
Knuth and the GMP/MPIR source code had no answers, and here I found only Multiplication between big integers and doubles which is not really applicable, since second answer is too exotic and the first loses too much precision.
Working from first principles and simulating big integers with a uint64_t I came up with this (runnable with 64-bit VC++ or gcc/MinGW64):
#include <cassert>
#include <cfloat>
#include <climits>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <intrin.h> // VC++, MinGW
#define PX(format,expression) std::printf("\n%35s == " format, #expression, expression);
typedef std::uint64_t limb_t;
// precision will be the lower of LIMB_BITS and DBL_MANT_DIG
enum { LIMB_BITS = sizeof(limb_t) * CHAR_BIT };
// simulate (M * factor) mod M with a 'big integer' M consisting of a single limb
void test_mod_mul (limb_t modulus, double factor)
{
assert( factor >= 0 );
// extract the fractional part of the factor and discard the integer portion
double ignored_integer_part;
double fraction = std::modf(factor, &ignored_integer_part);
// extract the significand (aligned at the upper end of the limb) and the exponent
int exponent;
limb_t significand = limb_t(std::ldexp(std::frexp(fraction, &exponent), LIMB_BITS));
// multiply modulus and single-limb significand; the product will have (n + 1) limbs
limb_t hi;
/* limb_t lo = */_umul128(modulus, significand, &hi);
// The result comprises at most n upper limbs of the product; the lowest limb will be
// discarded in any case, and potentially more. Factors >= 1 could be handled as well,
// by dropping the modf() and handling exponents > 0 via left shift.
limb_t result = hi;
if (exponent)
{
assert( exponent < 0 );
result >>= -exponent;
}
PX("%014llX", result);
PX("%014llX", limb_t(double(modulus) * fraction));
}
int main ()
{
limb_t const M = 0x123456789ABCDEull; // <= 53 bits (for checking with doubles)
test_mod_mul(M, 1 - DBL_EPSILON);
test_mod_mul(M, 0.005615234375);
test_mod_mul(M, 9.005615234375);
test_mod_mul(M, std::ldexp(1, -16));
test_mod_mul(M, std::ldexp(1, -32));
test_mod_mul(M, std::ldexp(1, -52));
}
The multiplication and shift will be done with big integer math in my app, but the principle should be the same.
Is the basic approach correct or is the test working only because I'm testing with toy integers here? I don't know the first thing about floating point math, and I picked the functions from a C++ reference.
Clarification: everything from the multiplication onward will be done with (partial) big integer math; here I'm only using limb_t for this purpose in order to get a tiny toy program which can be posted and that actually runs. The final app will be using the moral equivalent of GMP's mpn_mul_1() and mpn_rshift().
A floating point number is nothing but a product of three terms. The three terms a sign, a significand (sometimes called mantissa), and an exponent. The value of these three terms is computed as
(-1)sign * significand * baseexponent
The base is normally 2 although the C++ standard doesn't guarantee that. Correspondingly, your computation becomes
(M * factor) mod M
== (M * (-1)sign * significand * baseexponent) mod M
== ((-1)sign(M) + sign * abs(M) * significand * baseexponent) mod M
Computing the sign of the result should be rather trivial. Computing X * baseexponent is rather straight forward to: it is either a suitable shift of bits if the base is 2 or a multiplication with/division by the power of base (left shift or multiplication for positive exponent, right shift or division for negative exponent). Assuming your big integer representation supports the modulus operation already, the only interesting term is the multiplication of abs(M) * significand but this is just a normal integer multiplication, although for your big integer representation. I haven't checked too closely but I think this is what the first answer you linked to does (the one you described as "too exotic").
The remaining bit is to extract sign, significand, and exponent from a double
. The sign can be determine easily by a comparison with 0.0
and the significand and the exponent can be obtained using frexp()
(see this answer for example). The significand is returned as a double
, i.e., you probably want to multiply it by 2std::numeric_limits<double>::digits
and adjust the exponent appropriately (I haven't done this in a while, i.e., I'm not entirely sure about the extact contract of frexp()
).
To answer your question: I'm not familiar with the GMP operations you are using but I think the operations you do indeed execute the computation described above.