I'm new to R. I'm used to VB where loops are used heavily, but I know R is much more efficient if I can vectorize the data. I don't know if it's possible to vectorize what I've constructed here.
The general idea is, for n=3:N
:
n
from the original sample (of size N
)B
resamplesX
timesX
parameter estimates and check convergence (i.e., look at the standard deviation between estimates or something like that - TBD).I didn't implement step 4 yet, so the code below just does steps 1:3. Step 4 should be simple enough to do outside the loop using rowMeans()
.
Note: I set B and X to 100 for testing, but I need both to equal 10,000 (or more) for final use
# simulate observation of N=30
bdf <- data.frame(sample(8:13, 30, rep = TRUE)
# get number of observations
N <- length(bdf)
# set number of bootstrap replicates
B <- 100
# set number of times to repeat the estimate
X <- 100
# create empty storage container for results
result_vec <- vector(length=B)
# this loop iterates over the number of times to repeat the estimate
for (j in 1:X) {
# this loop iterates sample size from n=2 to n=N
for (i in 3:N) {
# random sample of size n
boot_samp <- bdf[sample(N, size=i, replace=FALSE)]
# this loop does the bootstrap sampling
for(b in 1:B) {
# draw a bootstrap sample
bsamp <- sample(boot_samp, size=i, replace=TRUE)
# calculate your parameter
p <- mean(bsamp)
#p <- sd(bsamp)
# save the calculated parameter
result_vec[b] <- p
}
if (i==3) {
# initiate data frame and store the results for n=2 parameter estimate
df_res <- data.frame(result_vec)
}
else {
# add the results for n=i parameter estimate to the data frame
df_temp <- data.frame(result_vec)
df_res <- cbind(df_res, df_temp)
}
# rename the column in the data frame as n=i
names(df_res)[ncol(df_res)] <- paste("n = ",i)
}
# calculate the mean of the parameter estimates
allmeans <- colMeans(df_res)
if (j==1) {
# initiate a new data frame to store the means
df_means <- data.frame(allmeans)
}
else {
# add the results to the existing data frame
df_temp <- data.frame(allmeans)
df_means <- cbind(df_means, df_temp)
}
# rename the column in the data frame with j
names(df_means)[ncol(df_means)] <- j
}
The biggest speed culprit here is repeatedly calling cbind
to grow your data.frame
. See this post.
The loops can all be replaced with a couple mapply
calls. Since the innermost loop is done with replacement, the samples can be done all at once and placed in a matrix for rowMeans
.
# simulate observation of N=30
bdf <- data.frame(sample(8:13, 30, rep = TRUE))
# get number of observations
N <- nrow(bdf)
# set number of bootstrap replicates
B <- 1e4
# set number of times to repeat the estimate
X <- 100
# replace the loops to iterate over the number of times to repeat the estimate
system.time({
df_means <- mapply(
\(j) colMeans(
mapply(
\(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
)
), 1:X
)
dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
})
#> user system elapsed
#> 21.58 1.39 22.98
Additionally, this procedure is really easy to run in parallel:
library(parallel)
X <- 1e3
system.time({
cl <- makeCluster(detectCores() - 1) # 15 cores
clusterExport(cl, c("bdf", "N", "B"))
df_means <- simplify2array(
parLapply(cl, 1:X, \(j) colMeans(
mapply(
\(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
)
))
)
stopCluster(cl)
dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
})
#> user system elapsed
#> 0.02 0.14 22.08
Doing B = X = 10000
in parallel would take less than 4 minutes on my aging laptop.