modulo factorial modular-arithmetic montgomery-multiplication

Can Montgomery multiplication be used to speed up the computation of (large number)! % (some prime)

This question originates in a comment I almost wrote below this question, where Zack is computing the factorial of a large number modulo a large number (that we will assume to be prime for the sake of this question). Zack is using the traditional computation of factorial, taking the remainder at each multiplication.

I almost commented that an alternative to consider was Montgomery multiplication, but thinking more about it, I have only seen this technique used to speed up several multiplications by the same multiplicand (in particular, to speed up the computation of aⁿ mod p).

My question is: can Montgomery multiplication be used to speed up the computation of n! mod p for large n and p?

Solution

Naively, no; you need to transform each of the n terms of the product into the "Montgomery space", so you have n full reductions mod m, the same as the "usual" algorithm.

However, a factorial isn't just an arbitrary product of n terms; it's much more structured. In particular, if you already have the "Montgomerized" kr mod m, then you can use a very cheap reduction to get (k+1)r mod m.

So this is perfectly feasible, though I haven't seen it done before. I went ahead and wrote a quick-and-dirty implementation (very untested, I wouldn't trust it very far at all):

// returns m^-1 mod 2**64 via clever 2-adic arithmetic (http://arxiv.org/pdf/1209.6626.pdf)
uint64_t inverse(uint64_t m) {
    assert(m % 2 == 1);
    uint64_t minv = 2 - m;
    uint64_t m_1 = m - 1;
    for (int i=1; i<6; i+=1) { m_1 *= m_1; minv *= (1 + m_1); }
    return minv;
}

uint64_t montgomery_reduce(__uint128_t x, uint64_t minv, uint64_t m) {
    return x + (__uint128_t)((uint64_t)x*-minv)*m >> 64;
}

uint64_t montgomery_multiply(uint64_t x, uint64_t y, uint64_t minv, uint64_t m) {
    return montgomery_reduce(full_product(x, y), minv, m);
}

uint64_t montgomery_factorial(uint64_t x, uint64_t m) {
    assert(x < m && m % 2 == 1);
    uint64_t minv = inverse(m); // m^-1 mod 2**64
    uint64_t r_mod_m = -m % m;  // 2**64 mod m
    uint64_t mont_term = r_mod_m;
    uint64_t mont_result = r_mod_m;
    for (uint64_t k=2; k<=x; k++) {
        // Compute the montgomerized product term: kr mod m = (k-1)r + r mod m.
        mont_term += r_mod_m;
        if (mont_term >= m) mont_term -= m;
        // Update the result by multiplying in the new term.
        mont_result = montgomery_multiply(mont_result, mont_term, minv, m);
    }
    // Final reduction
    return montgomery_reduce(mont_result, minv, m);
}

and benchmarked it against the usual implementation:

__uint128_t full_product(uint64_t x, uint64_t y) {
    return (__uint128_t)x*y;
}

uint64_t naive_factorial(uint64_t x, uint64_t m) {
    assert(x < m);
    uint64_t result = x ? x : 1;
    while (x --> 2) result = full_product(result,x) % m;
    return result;
}

and against the usual implementation with some inline asm to fix a minor inefficiency:

uint64_t x86_asm_factorial(uint64_t x, uint64_t m) {
    assert(x < m);
    uint64_t result = x ? x : 1;
    while (x --> 2) {
        __asm__("mov %[result], %%rax; mul %[x]; div %[m]"
                : [result] "+d" (result) : [x] "r" (x), [m] "r" (m) : "%rax", "flags");
    }
    return result;
}

Results were as follows on my Haswell laptop for reasonably large x:

implementation   speedup
---------------------------
naive            1.00x
x86_asm          1.76x
montgomery       5.68x

So this really does seem to be a pretty nice win. The codegen for the Montgomery implementation is pretty decent, but could probably be improved somewhat further with hand-written assembly as well.

This is an interesting approach for "modest" x and m. Once x gets large, the various approaches that have sub-linear complexity in x will necessarily win out; factorial has so much structure that this method doesn't take advantage of.