Calculating Catalan Numbers mod prime number

The following is the problem description:

let c[n] be the catalan number for n and p be a large prime eg.1000000007

I need to calculate c[n] % p where n ranges from {1,2,3,...,1000}

The problem which I am having is that on a 32 bit machine you get overflow when you calculate catalan number for such large integer. I am familiar with modulo arithmetic. Also

(a.b) % p = ((a % p)(b % p)) % p

this formula helps me to get away with the overflow in numerator separately but I have no idea how to deal with denominators.

Solution

For a modulus of 1000000007, avoiding overflow with only 32-bit integers is cumbersome. But any decent C implementation provides 64-bit integers (and any decent C++ implementation does too), so that shouldn't be necessary.

Then to deal with the denominators, one possibility is, as KerrekSB said in his comment, to calculate the modular inverse of the denominators modulo the prime p = 1000000007. You can calculate the modular inverse with the extended Euclidean algorithm or, equivalently, the continued fraction expansion of k/p. Then instead of dividing by k in the calculation, you multiply by its modular inverse.

Another option is to use Segner's recurrence relation for the Catalan numbers, which gives a calculation without divisions:

C(0) = 1
         n
C(n+1) = ∑ C(i)*C(n-i)
         0

Since you only need the Catalan numbers C(k) for k <= 1000, you can precalculate them, or quickly calculate them at program startup and store them in a lookup table.

If contrary to expectation no 64-bit integer type is available, you can calculate the modular product by splitting the factors into low and high 16 bits,

a = a1 + (a2 << 16)    // 0 <= a1, a2 < (1 << 16)
b = b1 + (b2 << 16)    // 0 <= b1, b2 < (1 << 16)
a*b = a1*b1 + (a1*b2 << 16) + (a2*b1 << 16) + (a2*b2 << 32)

To calculate a*b (mod m) with m <= (1 << 31), reduce each of the four products modulo m,

p1 = (a1*b1) % m;
p2 = (a1*b2) % m;
p3 = (a2*b1) % m;
p4 = (a2*b2) % m;

and the simplest way to incorporate the shifts is

for(i = 0; i < 16; ++i) {
    p2 *= 2;
    if (p2 >= m) p2 -= m;
}

the same for p3 and with 32 iterations for p4. Then

s = p1+p2;
if (s >= m) s -= m;
s += p3;
if (s >= m) s -= m;
s += p4;
if (s >= m) s -= m;
return s;

That way is not very fast, but for the few multiplications needed here, it's fast enough. A small speedup should be obtained by reducing the number of shifts; first calculate (p4 << 16) % m,

for(i = 0; i < 16; ++i) {
    p4 *= 2;
    if (p4 >= m) p4 -= m;
}

then all of p2, p3 and the current value of p4 need to be multiplied with 2¹⁶ modulo m,

p4 += p3;
if (p4 >= m) p4 -= m;
p4 += p2;
if (p4 >= m) p4 -= m;
for(i = 0; i < 16; ++i) {
    p4 *= 2;
    if (p4 >= m) p4 -= m;
}
s = p4+p1;
if (s >= m) s -= m;
return s;