Sampling using common random numbers in r (efficiently!)

Is there any way to perform sampling using common random numbers with R?

There are many cases where you do the following many times (for instance, if you wanted to plot Monte Carlo estimates at many different parameter values). First, you sample, say, ten thousand variates from a normal distribution, and second, you take the average of some function of these samples, returning a single floating point numbers. Now, if I wanted to change a few parameters, changing either of these two functions, I would have to re-do those steps over and over again.

The naive way would be to sample fresh draws over and over again using some function like rnorm(). A less naive way would be to use a different function that takes a large collection of common random numbers. However, if I used this approach, there might still be a lot of copying going on here, due to R mostly using pass-by-value semantics. What are some tools that would allow me to get around this and avoid all this copying in the second situation?

Solution

I think you're asking two types of questions here:

Programmatically, can we preserve a large pull of random data in such a way that side-steps R's default pass-by-value?
Mathematically, if we make a large pull of random data and pick from it piece-meal, can we arbitrarily change the parameters used in the pull?

The answer to 1 is "yes": pass-by-reference semantics are possible in R, but they take a little more work. All of the implementations I've seen and played with are done with environments or non-R-native objects (C/C++ pointers to structs or such). Here is one example that caches a large pull of random "normal" data and checks the pool of available data on each call:

my_rnorm_builder <- function(deflen = 10000) {
  .cache <- numeric(0)
  .index <- 0L
  .deflen <- deflen
  check <- function(n) {
    if ((.index + n) > length(.cache)) {
      message("reloading") # this should not be here "in-production"
      l <- length(.cache)
      .cache <<- c(.cache[ .index + seq_len(l - .index) ],
                   rnorm(.deflen + n + l))
      .index <<- 0L
    }
  }
  function(n, mean = 0, sd = 1) {
    check(n)
    if (n > 0) {
      out <- mean + sd * .cache[ .index + seq_len(n) ]              
      .index <<- .index + as.integer(n)
      return(out)
    } else return(numeric(0))
  }
}

It is by-far not resilient to hostile users or other likely mistakes. It does not guarantee the length of available remaining random numbers. (To put in checks like that would slow it down below a threshold of reasonable-ness, with the benchmark in mind.)

Demo of it in operation:

my_rnorm <- my_rnorm_builder(1e6)
# starts empty
get(".index", env=environment(my_rnorm))
# [1] 0
length(get(".cache", env=environment(my_rnorm)))
# [1] 0

set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145  0.1848492  1.5878453
my_rnorm(3) # should not see "reloading"
# [1] -1.13037567 -0.08025176  0.13242028
# prove that we've changed things internally
get(".index", env=environment(my_rnorm))
# [1] 6
length(get(".cache", env=environment(my_rnorm)))
# [1] 1000003

head(my_rnorm(1e6)) # should see "reloading"
# reloading
# [1]  0.7079547 -0.2396980  1.9844739 -0.1387870  0.4176508  0.9817528

Let's make sure that the random-number scaling of sigma*x+mu makes sense by starting over and re-setting our seed:

# reload the definition of my_rnorm
my_rnorm <- my_rnorm_builder(1e6)
length(get(".cache", env=environment(my_rnorm)))
# [1] 0
set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145  0.1848492  1.5878453
my_rnorm(3, mean = 100) # should not see "reloading"
# [1]  98.86962  99.91975 100.13242

So to answer question 2: "yes". Quick inspection reveals that those last three numbers are indeed "100 plus" the numbers in the second my_rnorm(3) in the previous block. So just shifting "normal" random numbers by mu/sigma holds. And we did this while still using the large pre-pulled cache of random data.

But is it worth it? This is a naïve test/comparison in and of itself, constructive suggestions are welcome.

t(sapply(c(1,5,10,100,1000,10000), function(n) {
  s <- summary(microbenchmark::microbenchmark(
    base = rnorm(n),
    my   = my_rnorm(n),
    times = 10000, unit = "ns"
  ))
  c(n = n, setNames(s$median, s$expr))  
}))
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
#          n   base    my
# [1,]     1   1100  1100
# [2,]     5   1400  1300
# [3,]    10   1600  1400
# [4,]   100   6400  2000
# [5,]  1000  53100  6600
# [6,] 10000 517000 49900

(All medians are in nanoseconds.) So while it would have seemed intuitive that "smaller pulls done more frequently" (with rnorm) would have benefited from this caching, I cannot explain why it is not very helpful until pulls 100 and greater.

Can this be extended to other distributions? Almost certainly. "Uniform" would be straight forward (similarly scale and shift), but some others might take a little more calculus to do correctly. (For instance, it is not obvious without more research how the "t" distribution could alter the degrees-of-freedom on pre-pulled data ... if that's even possible. Though I do count myself a statistician in some ways, I am not prepared to claim yes/no/maybe on that one yet :-)