Search code examples
modulofactorialmodular-arithmeticmontgomery-multiplication

Can Montgomery multiplication be used to speed up the computation of (large number)! % (some prime)


This question originates in a comment I almost wrote below this question, where Zack is computing the factorial of a large number modulo a large number (that we will assume to be prime for the sake of this question). Zack is using the traditional computation of factorial, taking the remainder at each multiplication.

I almost commented that an alternative to consider was Montgomery multiplication, but thinking more about it, I have only seen this technique used to speed up several multiplications by the same multiplicand (in particular, to speed up the computation of an mod p).

My question is: can Montgomery multiplication be used to speed up the computation of n! mod p for large n and p?


Solution

  • Naively, no; you need to transform each of the n terms of the product into the "Montgomery space", so you have n full reductions mod m, the same as the "usual" algorithm.

    However, a factorial isn't just an arbitrary product of n terms; it's much more structured. In particular, if you already have the "Montgomerized" kr mod m, then you can use a very cheap reduction to get (k+1)r mod m.

    So this is perfectly feasible, though I haven't seen it done before. I went ahead and wrote a quick-and-dirty implementation (very untested, I wouldn't trust it very far at all):

    // returns m^-1 mod 2**64 via clever 2-adic arithmetic (http://arxiv.org/pdf/1209.6626.pdf)
    uint64_t inverse(uint64_t m) {
        assert(m % 2 == 1);
        uint64_t minv = 2 - m;
        uint64_t m_1 = m - 1;
        for (int i=1; i<6; i+=1) { m_1 *= m_1; minv *= (1 + m_1); }
        return minv;
    }
    
    uint64_t montgomery_reduce(__uint128_t x, uint64_t minv, uint64_t m) {
        return x + (__uint128_t)((uint64_t)x*-minv)*m >> 64;
    }
    
    uint64_t montgomery_multiply(uint64_t x, uint64_t y, uint64_t minv, uint64_t m) {
        return montgomery_reduce(full_product(x, y), minv, m);
    }
    
    uint64_t montgomery_factorial(uint64_t x, uint64_t m) {
        assert(x < m && m % 2 == 1);
        uint64_t minv = inverse(m); // m^-1 mod 2**64
        uint64_t r_mod_m = -m % m;  // 2**64 mod m
        uint64_t mont_term = r_mod_m;
        uint64_t mont_result = r_mod_m;
        for (uint64_t k=2; k<=x; k++) {
            // Compute the montgomerized product term: kr mod m = (k-1)r + r mod m.
            mont_term += r_mod_m;
            if (mont_term >= m) mont_term -= m;
            // Update the result by multiplying in the new term.
            mont_result = montgomery_multiply(mont_result, mont_term, minv, m);
        }
        // Final reduction
        return montgomery_reduce(mont_result, minv, m);
    }
    

    and benchmarked it against the usual implementation:

    __uint128_t full_product(uint64_t x, uint64_t y) {
        return (__uint128_t)x*y;
    }
    
    uint64_t naive_factorial(uint64_t x, uint64_t m) {
        assert(x < m);
        uint64_t result = x ? x : 1;
        while (x --> 2) result = full_product(result,x) % m;
        return result;
    }
    

    and against the usual implementation with some inline asm to fix a minor inefficiency:

    uint64_t x86_asm_factorial(uint64_t x, uint64_t m) {
        assert(x < m);
        uint64_t result = x ? x : 1;
        while (x --> 2) {
            __asm__("mov %[result], %%rax; mul %[x]; div %[m]"
                    : [result] "+d" (result) : [x] "r" (x), [m] "r" (m) : "%rax", "flags");
        }
        return result;
    }
    

    Results were as follows on my Haswell laptop for reasonably large x:

    implementation   speedup
    ---------------------------
    naive            1.00x
    x86_asm          1.76x
    montgomery       5.68x
    

    So this really does seem to be a pretty nice win. The codegen for the Montgomery implementation is pretty decent, but could probably be improved somewhat further with hand-written assembly as well.

    This is an interesting approach for "modest" x and m. Once x gets large, the various approaches that have sub-linear complexity in x will necessarily win out; factorial has so much structure that this method doesn't take advantage of.