Search code examples
rloopsvectorization

Is there a way to vectorize these bootstrappign loops in R?


I'm new to R. I'm used to VB where loops are used heavily, but I know R is much more efficient if I can vectorize the data. I don't know if it's possible to vectorize what I've constructed here.

The general idea is, for n=3:N:

  1. Pull a random sample (without replacement) of size n from the original sample (of size N)
  2. Run a bootstrap parameter estimate on this subsample using B resamples
  3. Do it again X times
  4. Average all X parameter estimates and check convergence (i.e., look at the standard deviation between estimates or something like that - TBD).

I didn't implement step 4 yet, so the code below just does steps 1:3. Step 4 should be simple enough to do outside the loop using rowMeans().

Note: I set B and X to 100 for testing, but I need both to equal 10,000 (or more) for final use

# simulate observation of N=30
bdf <- data.frame(sample(8:13, 30, rep = TRUE)

# get number of observations
N <- length(bdf)

# set number of bootstrap replicates
B <- 100

# set number of times to repeat the estimate
X <- 100

# create empty storage container for results
result_vec <- vector(length=B)

# this loop iterates over the number of times to repeat the estimate
for (j in 1:X) {
  
  # this loop iterates sample size from n=2 to n=N
  for (i in 3:N) {
    
    # random sample of size n
    boot_samp <- bdf[sample(N, size=i, replace=FALSE)]
    
    # this loop does the bootstrap sampling
    for(b in 1:B) {
      # draw a bootstrap sample
      bsamp <- sample(boot_samp, size=i, replace=TRUE)
      
      # calculate your parameter
      p <- mean(bsamp)
      #p <- sd(bsamp)
      
      # save the calculated parameter
      result_vec[b] <- p
      
    }
    if (i==3) {
      # initiate data frame and store the results for n=2 parameter estimate
      df_res <- data.frame(result_vec)
    }
    else {
      # add the results for n=i parameter estimate to the data frame
      df_temp <- data.frame(result_vec)
      df_res <- cbind(df_res, df_temp)
    }
    
    # rename the column in the data frame as n=i
    names(df_res)[ncol(df_res)] <- paste("n = ",i)
  }
  
  # calculate the mean of the parameter estimates
  allmeans <- colMeans(df_res)
  
  if (j==1) {
    # initiate a new data frame to store the means
    df_means <- data.frame(allmeans)
  }
  else {
    # add the results to the existing data frame
    df_temp <- data.frame(allmeans)
    df_means <- cbind(df_means, df_temp)
  }
  
  # rename the column in the data frame with j
  names(df_means)[ncol(df_means)] <- j
}

Solution

  • The biggest speed culprit here is repeatedly calling cbind to grow your data.frame. See this post.

    The loops can all be replaced with a couple mapply calls. Since the innermost loop is done with replacement, the samples can be done all at once and placed in a matrix for rowMeans.

    # simulate observation of N=30
    bdf <- data.frame(sample(8:13, 30, rep = TRUE))
    
    # get number of observations
    N <- nrow(bdf)
    
    # set number of bootstrap replicates
    B <- 1e4
    
    # set number of times to repeat the estimate
    X <- 100
    
    # replace the loops to iterate over the number of times to repeat the estimate
    system.time({
      df_means <- mapply(
        \(j) colMeans(
          mapply(
            \(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
          )
        ), 1:X
      )
      dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
    })
    #>    user  system elapsed 
    #>   21.58    1.39   22.98
    

    Additionally, this procedure is really easy to run in parallel:

    library(parallel)
    
    X <- 1e3
    
    system.time({
      cl <- makeCluster(detectCores() - 1) # 15 cores
      clusterExport(cl, c("bdf", "N", "B"))
      df_means <- simplify2array(
        parLapply(cl, 1:X, \(j) colMeans(
          mapply(
            \(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
          )
        ))
      )
      
      stopCluster(cl)
      dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
    })
    #>    user  system elapsed 
    #>    0.02    0.14   22.08
    

    Doing B = X = 10000 in parallel would take less than 4 minutes on my aging laptop.