Controlling rounding with floating values

I am using large integers (128 bit or more) with boost. I want to avoid some very slow divisions by using floating point. I can get away with just the upper bits of the quotients in many cases and I want to use this by using floating point. Say I have:

ccp_int a, b; // big integers +ve
double fa, fb; // a float that would be big enough to hold the exponent.
fa/fb >= a/b  so a floor of a divide
(fa - 1) / fb + 1 <= (a - 1) / b + 1  a ceiling of a divide

So presumably this means being able to assign fa = a say with correct rounding and also to do the divides rounding in different directions. Is this the kind of thing you can control with floating point? My experience of fp is pretty much zero.

Solution

Like Eric mentioned, fesetround() is in principle the way to control the float rounding mode, but unfortunately it does not have great support.

Here is a thought on a way around this. Extract the upper 32 bits manually, then use POD arithmetic to compute the division. So essentially, emulate the floating point conversion and division in order to have full control over how rounding is done. To simplify this outline, I'll assume a and b are both positive at least 2^32, and hopefully it will be clear how to fill these details for a general implementation:

Use a_bits = msb(a) to get the index of the most significant bit in a that is set to 1 (for more details, search for "msb" in this page on boost number).
Get the upper 32 bits with uint32_t a_high = a >> (a_bits - 31).

This step is effectively a rounding operation, rounding down. To round up instead, we can increment a_high by one if lsb(a) < a_bits - 31, that is, if the lowest set bit occurs below the extracted 32-bit portion. (An edge case is that this increment may cause the uint32_t to overflow, which I'll ignore.) I'll use "a_high_down" to refer to rounding down and "a_high_up" for rounding up.

Either way, we have a float approximation of the dividend a:
```
a ≈ 2^(a_bits − 31) ⋅ a_high.
```
In the same way, make an approximate representation of the divisor b:
```
b ≈ 2^(b_bits − 31) ⋅ b_high.
```

Then, the division a/b is approximately

a/b ≈ 2^(a_bits − b_bits − 32) ⋅ (2^32 ⋅ a_high) // b_high,

where // denotes integer division rounding down. The numerator 2^32 ⋅ a_high should be formed by casting a_high to uint64_t and shifting up.

A couple notes:

This adjustment is important so that the quotient keeps some bits of precision (since a_high and b_high have similar magnitude, doing simply a_high // b_high as an integer division would give only 1 bit of precision).
Instead of this 32-bit shifting, we could have extracted a_high as the upper 64 bits of a for a little more precision.

To bound the quotient from above and below, compute

a/b ≤ 2^(a_bits − b_bits − 32) ⋅ ((2^32 ⋅ a_high_up − 1) // b_high_down + 1),
a/b ≥ 2^(a_bits − b_bits − 32) ⋅ (2^32 ⋅ a_high_down) // b_high_up,

where // denotes integer division rounding down. The divisions on the rhs can be done in native uint64_t arithmetic. Finally, cast these division results to cpp_int and left-shifted by (a_bits − b_bits − 32).

Examples

Ex 1. For the values

a = 2^100 ⋅ 123456 = 156499072501776288991176990922899456,
b = 2^70 ⋅ 123455 = 145749938535668012464209920,

the exact quotient is about a/b = 1073750521.434887. The procedure above accurately approximates this, with the first 10 digits correct:

a_bits = floor(log2(a)) = 116
a_high = a >> (116 − 31) = 4045406208

b_bits = floor(log2(b)) = 86
b_high = b >> (86 − 31) = 4045373440

a/b ≈ 2^(a_bits − b_bits − 32) ⋅ (2^32 ⋅ a_high) // b_high
    = 2^(−2) ⋅ (2^32 ⋅ 4045406208) // 4045373440
    = 0.25 ⋅ 4295002085
    = 1073750521.25.

Ex 2. For the values

a = 10^50 = 100000000000000000000000000000000000000000000000000,
b = 9^50 = 515377520732011331036461129765621272702107522001,

the exact quotient is about a/b = 194.03252175, and

a_bits = floor(log2(a)) = 166
a_high_down = a >> (166 − 31) = 2295887403
a_high_up = a_high_down + 1 = 2295887404

b_bits = floor(log2(b)) = 158
b_high_down = b >> (158 − 31) = 3029116820
b_high_up = b_high_down + 1 = 3029116821

a/b ≤ 2^(a_bits − b_bits − 32) ⋅ ((2^32 ⋅ a_high_up − 1) // b_high_down + 1)
    = 2^(−24) ⋅ (9860761315478339583 // 3029116820 + 1)
    = 2^(−24) ⋅ 3255325530
    = 194.03252184

a/b ≥ 2^(a_bits − b_bits − 32) ⋅ (2^32 ⋅ a_high_down) // b_high_up
    = 2^(−24) ⋅ (9860761311183372288 // 3029116821)
    = 2^(−24) ⋅ 3255325526
    = 194.03252161