Search code examples
pythonnumpyfloating-pointprobabilityieee-754

What are the odds of a repeat in numpy.random.rand(n) (assuming perfect randomness)?


For the moment, put aside any issues relating to pseudorandom number generators and assume that numpy.random.rand perfectly samples from the discrete distribution of floating point numbers over [0, 1). What are the odds getting at least two exactly identical floating point numbers in the result of:

numpy.random.rand(n)

for any given value of n?

Mathematically, I think this is equivalent to first asking how many IEEE 754 singles or doubles there are in the interval [0, 1). Then I guess the next step would be to solve the equivalent birthday problem? I'm not really sure. Anyone have some insight?


Solution

  • The computation performed by numpy.random.rand for each element generates a number 0.<53 random bits>, for a total of 2^53 equally likely outputs. (Of course, the memory representation isn't a fixed-point 0.stuff; it's still floating point.) This computation is incapable of producing most binary64 floating-point numbers between 0 and 1; for example, it cannot produce 1/2^60. You can see the code in numpy/random/mtrand/randomkit.c:

    double
    rk_double(rk_state *state)
    {
        /* shifts : 67108864 = 0x4000000, 9007199254740992 = 0x20000000000000 */
        long a = rk_random(state) >> 5, b = rk_random(state) >> 6;
        return (a * 67108864.0 + b) / 9007199254740992.0;
    }
    

    (Note that rk_random produces 32-bit outputs, regardless of the size of long.)

    Assuming a perfect source of randomness, the probability of repeats in numpy.random.rand(n) is 1-(1-0/k)(1-1/k)(1-2/k)...(1-(n-1)/k), where k=2^53. It's probably best to use an approximation instead of calculating this directly for large values of n. (The approximation may even be more accurate, depending on how the approximation error compares to the rounding error accumulated in a direct computation.)