Search code examples
c++algorithmdivision

Fast Division C++


I'm working on a scientific computing project that requires arbitrary precision for any real number. The way I am representing the real number approximations as two dynamic arrays of uint64_t. Each uint64_t is a chunk of 64 digits, and the two arrays are the digits in the integer part and fractional part. When more digits in the fractional part are calculated, the chunk is pushed to the end of the dynamic array.

I have been trying to find a fast algorithm for long division. Obviously I can use the simple Euclidean division algorithm with the C++ functions / and % which utilize hardware.

My first attempt was tackling the problem p/q when p,q are uint64_t and where p<q; so this rational number is less than 1 and p,q < 2^64. My intuition was to use a base 2^64 "number system" where each digit is a uint64_t and treat each step in the Euclidean algorithm using /, % as atomic steps. The algorithm will return the digits of the first 64 decimal values (there are no significant digits on the left of the decimal point in this case) and the remainder value (which will also be a uint64_t) because the reminder is less than the denominator which is < 2^64.

However, after hours of algebra, I learned the hard way that if you are doing long division in a base b number system, you have to be able to divide a number < b^2. For example 13/5. (This is not a p<q example, but this is a base 10 number system, so we must be able to divide numbers <= 99; more specifically we only need to be able to divide numbers <= 89)

      2
  _____    // 5 does not divide 1 or 3, so it must be able to atomically divide 13
5 | 1 3

I think that the best way to do this is brute long division. But that seems like it will create too many branches in control flow. I know there must be a smart way to implement this that minimizes a lot of branches.

One immediate optimization is to utilize a fast Log2 function like the ones described in the first answer to the linked question. Using the log2 function, we can determine the place of the leading bit in a uint64_t. This way when can skip past chunks of zeros while performing the Euclidean algorithm, because the denominator will not divide a number whose leading bit has a place before the leading bit of the denominator. This gives the chance for the algorithm to perform nicely. But I can easily think of a worst case in which we perform a division for every bit.

This problem cannot be solved by changing the size of the "chunk" to uint32_t. It seems like we can avoid the 13/5 problem by concatenating two uint32_t in the numerator into a uint64_t and extending the uint32_t denominator to uint64_t. Then we can perform the hardware / and % in one shot. But, there are cases where the remainder is still very large (e.g. the leading bit is in the 2^60 place). We cannot perform this division in two shots. We have to scan across three digits and perform another division (not to mention the shifts and masks needed to fit the new working values in to the registers).

My question is simply: is this the best we have? It seems like this is inherently O(n) in / and % calls. I don't see any way that you can use divide and conquer or use some sort of smart arithmetic with carries. But I was very pleasantly surprised when learning about the fast log2 algorithm.


Solution

  • Let's look at long division using a number system with base 1000.

    For example, let's say you divide 12345 by 19683, or [012][345] by [019][683] to use base-1000 "digits".

    You immediately discover that p < q, and so you append a base-1000 digit [000]. Now you are dividing [012][345][000] by [019][683]. Your intermediate goal is to find a quotient and remainder of this integer division.

    The simplest algorithm is to try all digits from [000] to [999] and multiply q by them, then choose one which gives a remainder that is smaller than q (you get [627]). In base 2, this step is very easy, which makes division in higher-base number systems a bit confusing.

    An obvious improvement, the one you described in your question, is to determine the first "sub-digit": try multiplying first by [000]...[900]; you get the first sub-digit of your quotient: [6xx]. Then multiply q by "digits" [600]...[690]; you get the second sub-digit: [62x]. Then multiply q by "digits" [620]...[629]; you get the last sub-digit: [627]. Here, n = 3, and this algorithm uses O(n) multiplications.

    To improve this algorithm, you should use an existing hardware implementation, which can divide 128-bit numbers by 64-bit numbers, and find 64-bit quotient and remainder. In my base-1000 example, this corresponds to dividing numbers less than 1000000 by divisors less than 1000. The only thing which prevents you from doing this directly is that your divisor has more than one digit. However, it's pretty easy to overcome: calculate just an estimate of a quotient, using the most significant digit.

    In my example, use the most significant "digit" [019] to estimate "[012][345][000] divided by [019][683]". Naively, you can take two most significant digits from the dividend: 12345 / 19 = 649. You can then search around this value to find the correct value [627]. However, this method suffers from precision loss, because the first "digit" was so small. To fix this problem, normalize both dividend and divisor so the most significant "sub-digit" of the divisor is not 0:

    [012][345][000] / [019][683] = [123][450][000] / [196][830]
    [123][450][000] / [196][830] ≈      [123][450] /      [196]
         [123][450] /      [196] =           [629]
    

    This estimate is closer to the correct value. In fact, because of the range of the most significant "digit", you can prove an upper bound of such an estimation, which in my case is something like 10 (the smaller base of our number systems). If using binary, the upper bound will be about 2 (not sure about the proof; please do it by yourself). So you will need to search in the range ±2 around the estimated "digit" of the quotient to find the correct "digit". This is O(1) multiplications by divisor.

    So, in summary, you will need O(1) multiplications by divisor to calculate each "digit" of your quotient in base 264 if you first normalize your divisor such that its most significant bit is 1.

    To do 128-bit division, you can use uint128_t (if your platform has it), inline assembly or dedicated intrinsics like _udiv128.