Search code examples
stringalgorithmrandomprimesrabin-karp

Random primes and Rabin Karp substring search


I am reading the Rabin-Karb algorithm from Sedgewick. The book says:

We use a random prime Q taking as large a value as possible while avoiding overflow

At first reading I didn't notice the significance of random and when I saw that in the code a long is used my first thoughts were:
a) Use Eratosthene's sieve to find a big prime that fits a long
or
b) look up from a list of primes any prime large enough that is greater than int and use it as a constant.

But then the rest of the explanation says:

We will use a long value greater than 10^20 making the probability that a collision happens less than 10^-20

This part got me confused since a long can not fit 10^20 let alone a value greater than that. Then when I checked the calculation for the prime the book defers to an exercise that has just the following hint:

A random n-digit number is prime with probability proportional to 1/n

What does that mean?

So basically what I don't get is:
a) what is the meaning of using a random prime? Why can't we just pre-calculate it and use it as a constant?
b) why is the 10^20 mentioned since it is out of range for long?
c) How is that hint helpful? What does it mean exactly?


Solution

  • Once again, Sedgewick has tried to simplify an algorithm and gotten the details slightly wrong. First, as you observe, 1020 cannot be represented in 64 bits. Even taking a prime close to 263 − 1, however, you probably would want a bit of room to multiply the normal way without overflowing so that the subsequent modulo is correct. The answer uses a 31-bit prime, which makes this easy but only offers collision probabilities in the 10−9 range.

    The original version uses Rabin fingerprints and a random irreducible polynomial over 𝔽2[x], which from the perspective of algebraic number theory behaves a lot like a random prime over the integers. If we choose the polynomial to be degree 32 or 64, then the fingerprints fit perfectly into a computer word of the appropriate length, and polynomial addition and subtraction both work out to bitwise XOR, so there is no overflow.

    Now, Sedgewick presumably didn't want to explain how polynomial rings work. Fine. If I had to implement this approach in practice, I'd choose a prime p close to the max that was easy to mod by with cheap instructions (I'm partial to 231 − 227 + 1; EDIT actually 231 − 1 works even better since we don't need a smooth prime here) and then choose a random number in [1, p−1] to evaluate the polynomials at (this is how Wikipedia explains it). The reason that we need some randomness is that otherwise the oblivious adversary could choose an input that would be guaranteed to have a lot of hash collisions, which would severely degrade the running time.

    Sedgewick wanted to follow the original a little more closely than that, however, which in essence evaluates the polynomials at a fixed value of x (literally x in the original version that uses polynomial rings). He needs a random prime so that the oblivious adversary can't engineer collisions. Sieving numbers big enough is quite inefficient, so he turns to the Prime Number Theorem (which is the math behind his hint, but it holds only asymptotically, which makes a big mess theoretically) and a fast primality test (which can be probabilistic; the cases where it fails won't influence the correctness of the algorithm, and they are rare enough that they won't affect the expected running time).

    I'm not sure how he proves a formal bound on the collision probability. My rough idea is basically, show that there are enough primes in the window of interest, use the Chinese Remainder Theorem to show that it's impossible for there to be a collision for too many primes at once, conclude that the collision probability is bounded by the probability of picking a bad prime, which is low. But the Prime Number Theorem holds only asymptotically, so we have to rely on computer experiments regarding the density of primes in machine word ranges. Not great.