I want to perform bootstrap on this data set. Notice that the data has two factors: replicate
and level
, and two variables high.density
and low.density
that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.
replicate level high.density low.density
1 low 14 36
1 low 54 31
1 mid 82 10
1 mid 24 NA
2 low 12 28
2 low 11 45
2 mid 12 17
2 mid NA 24
2 up 40 10
2 up NA 5
2 up 20 2
For instance, in replicate/ level
: 1/low
the low.density
31 and 36 can be interchanged (or high.density
interchanged) so the head of that dataset may look like:
replicate level high.density low.density
1 low 14 31
1 low 54 36
1 mid 82 10
1 mid 24 NA
I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.
##DATA FRAME (credits: caldwellst)
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level)
)
We may exploit split
and do the sampling according to unique combinations of replicate
and level
. We could repeat this process B
times.
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, function(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
replicate level high.density low.density
1 1 low 54 36
2 1 low 54 36
3 2 low 12 45
4 2 low 12 28
5 1 mid 24 10
6 1 mid 82 10
7 2 mid NA 17
8 2 mid 12 17
9 2 up 20 10
10 2 up 40 10
11 2 up 20 5
Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.
# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
Finally, we perform a linear regression on each shuffled dataset and store the B
coefficients in out
.
row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)
## out <- row_bind(
## lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
## family = binomial())$coef)
## )
# -------------------------------------------------------------------
> dim(out)
[1] 1000 2
Output
> head(out)
X.Intercept. low.density
1 13.74881 0.2804738
2 20.01074 -0.2095672
3 30.26643 -0.2946373
4 29.19541 -0.2752761
5 37.76273 -0.4555651
6 37.72250 -0.1548349
The code required to create this image can be found here.