Search code examples
rrandom

Repeated random numbers in R, how to stop?


I am generating a small collection of random uniform numbers in R (10,000 in this case), and I am finding repeats at an alarming rate (say once every few repetitions on the 10,000 number draw). Why is this happening, and how can I stop it (efficiently)?

Here is a min working example where I just go ahead and draw 1 million unif(0,1) values. This seems to generate many repeated values. I did have some problems with finding the repeats, so had to do some ugly stuff below, and there is a chance this might fail to find repeats sometimes due to rounding or something, but it seems to work on my machine.

rs_val <- .Random.seed 
tvals <- runif(10^6,0,1) 
tr1 <- as.numeric(names(which(table(tvals)>1)))[1] 
tr_idx <- which(abs(tvals-tr1)<0.000000001) 
tr_idx[1]
tr_idx[2]
format(tvals[tr_idx[1]],digits=22) 
format(tvals[tr_idx[2]],digits=22) 
tvals[tvals==as.numeric(format(tvals[tr_idx[1]],digits=22))]

Presumably, 1 million draws shouldn't be that much at all for a modern pseudo-random number generator, yes? I've seen a few posts here about similar issues with C++, but none on R. In the latter it had something to do with the random draws occurring faster than the random seed resetting or something like that.

Can somebody provide an efficient modification to the above code to prevent the repeated random number issue? Basically, I'm just simulating a Poisson process on a long timeline, and this is causing simultaneous events, which is problematic. I suppose, I could just go through the simulation and discard the repeated draws, but it would be nice to not have to do that.


Solution

  • From ?Random:

    Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

    This is easy to verify:

    identical(min(diff(unique(sort(runif(1e6))))), 2^-32)
    #> [1] TRUE
    sum((runif(1e6) %% 2^-32))
    #> [1] 0
    

    So another option besides using Wichmann-Hill is to sample 1:(2^32 - 1) without replacement and divide by 2^32, although, this is not as performant as runif with Wichmann-Hill.

    system.time(x <- sample(2^32 - 1, 1e6)/2^32)
    #>    user  system elapsed 
    #>    0.16    0.01    0.17
    RNGkind(kind="Wichmann-Hill")
    system.time(x <- runif(1e6))
    #>    user  system elapsed 
    #>    0.04    0.00    0.05
    

    A second option is to call runif twice: once for 32-bit precision, and then a second call to fill in the gaps to make it double precision:

    RNGkind(kind="default")
    
    runif64 <- function(n) runif(n) + runif(n, -2^-33, 2^-33)
    x <- runif64(1e6)
    range(x)
    #> [1] 4.710828e-07 9.999975e-01
    anyDuplicated(x)
    #> [1] 0
    

    runif64 is about twice as slow as just calling runif.

    microbenchmark::microbenchmark(
      runif64(1e6),
      runif(1e6)
    )
    #> Unit: milliseconds
    #>            expr     min       lq     mean median       uq      max neval
    #>  runif64(1e+06) 61.1622 63.33880 65.91086 64.926 67.18855 109.3689   100
    #>    runif(1e+06) 29.7481 30.34125 31.91007 30.997 33.71515  40.3735   100
    

    The probability of seeing any duplicate values in 1e6 samples of 2^64 possible values is very small:

    library(Rmpfr)
    
    n <- mpfr(2^64, 128)
    m <- mpfr(1e6, 128)
    as.numeric(1 - exp(lgamma(n + 1) - m*log(n) - lgamma(n - m + 1)))
    #> [1] 2.710503e-08