r linear-regression glm statistics-bootstrap

How to bootstrap dataset in R which is blocked by a factor?

I want to perform bootstrap on this data set. Notice that the data has two factors: replicate and level, and two variables high.density and low.density that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.

replicate level high.density low.density
    1     low    14          36
    1     low    54          31
    1     mid    82          10
    1     mid    24          NA
    2     low    12          28
    2     low    11          45
    2     mid    12          17
    2     mid    NA          24
    2      up    40          10
    2      up    NA           5
    2      up    20           2

For instance, in replicate/ level: 1/low the low.density 31 and 36 can be interchanged (or high.density interchanged) so the head of that dataset may look like:

replicate level high.density low.density
    1     low    14          31
    1     low    54          36
    1     mid    82          10
    1     mid    24          NA

I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.

##DATA FRAME (credits: caldwellst)

    df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
    
    df$replicate <- as.factor(as.numeric(df$replicate))
    df$level <- as.factor(as.character(df$level)

)

Solution

We may exploit split and do the sampling according to unique combinations of replicate and level. We could repeat this process B times.

df_shuffle <- function(DF) {
  my_split <- split(DF, f = ~ DF$replicate + DF$level)
  shuffle <- lapply(my_split, function(x) {
    nrX <- nrow(x)
    cbind(x[, c('replicate', 'level')],
          high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
          low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
  })
  DF_new <- do.call(rbind, shuffle)
  rownames(DF_new) <- NULL
  return(DF_new)
}

B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
   replicate level high.density low.density
1          1   low           54          36
2          1   low           54          36
3          2   low           12          45
4          2   low           12          28
5          1   mid           24          10
6          1   mid           82          10
7          2   mid           NA          17
8          2   mid           12          17
9          2    up           20          10
10         2    up           40          10
11         2    up           20           5

Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.

# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])

Finally, we perform a linear regression on each shuffled dataset and store the B coefficients in out.

row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
  lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)

## out <- row_bind(
##   lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
##                            family = binomial())$coef)
## )

# -------------------------------------------------------------------
> dim(out)
[1] 1000    2

Output

> head(out)
  X.Intercept. low.density
1     13.74881   0.2804738
2     20.01074  -0.2095672
3     30.26643  -0.2946373
4     29.19541  -0.2752761
5     37.76273  -0.4555651
6     37.72250  -0.1548349

The code required to create this image can be found here.