r if-statement simulation simplification

Simplify Simulations on R

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning. However, I've run into the following problem, my professor has given us the following:

sim1 <- function(n) {
  xm <- matrix(nrow=n,ncol=2)
  for (i in 1:n) {
    d <- rnorm(1)
    if (runif(1) < 0.5) {
      xm[i,1] <- 1
      xm[i,2] <- 2.5*d + 69
    } else {
      xm[i,1] <- 0
      xm[i,2] <- 2*d + 64
    }
  }
  return(xm)
}

With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.

I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.

Any help means a whole lot. Thank you!

Solution

If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.

This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.

sim2 <- function(n) {
    n1 <- floor(n/2)  #this is how many of the else clause we'll do
    n2 <- n - n1  #this is how many of the if clause we'll do
    col11 <- rep(0, n1) #bam! we have a vector filled with 0s
    col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
    col21 <- rep(1, n2) #bam! vector filled with 1s
    col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
    xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
    return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}

No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).

Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.

Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.

We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.

This version runs 866% faster!

> system.time({ sim1(500000)})
   user  system elapsed 
  1.341   0.179   1.527 
> system.time({ sim2(500000)})
   user  system elapsed 
  0.145   0.011   0.158

If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.

sim3 <- function(n) {
    a <- 0
    for(j in 1:n) {
        if(runif(1) < 0.5) {
            a <- a + 1
        }
    }
    return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844

The binomial process produces 50% 1s and 50% 0s (column 1).