Search code examples
rlinear-regressionglmstatistics-bootstrap

How to bootstrap dataset in R which is blocked by a factor?


I want to perform bootstrap on this data set. Notice that the data has two factors: replicate and level, and two variables high.density and low.density that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.

replicate level high.density low.density
    1     low    14          36
    1     low    54          31
    1     mid    82          10
    1     mid    24          NA
    2     low    12          28
    2     low    11          45
    2     mid    12          17
    2     mid    NA          24
    2      up    40          10
    2      up    NA           5
    2      up    20           2

For instance, in replicate/ level: 1/low the low.density 31 and 36 can be interchanged (or high.density interchanged) so the head of that dataset may look like:

replicate level high.density low.density
    1     low    14          31
    1     low    54          36
    1     mid    82          10
    1     mid    24          NA

I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.

##DATA FRAME (credits: caldwellst)

    df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
    
    df$replicate <- as.factor(as.numeric(df$replicate))
    df$level <- as.factor(as.character(df$level)

)

Solution

  • We may exploit split and do the sampling according to unique combinations of replicate and level. We could repeat this process B times.

    df_shuffle <- function(DF) {
      my_split <- split(DF, f = ~ DF$replicate + DF$level)
      shuffle <- lapply(my_split, function(x) {
        nrX <- nrow(x)
        cbind(x[, c('replicate', 'level')],
              high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
              low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
      })
      DF_new <- do.call(rbind, shuffle)
      rownames(DF_new) <- NULL
      return(DF_new)
    }
    
    B <- 1000L
    df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
    # ---------------------------------------------------
    > df_list[[B]]
       replicate level high.density low.density
    1          1   low           54          36
    2          1   low           54          36
    3          2   low           12          45
    4          2   low           12          28
    5          1   mid           24          10
    6          1   mid           82          10
    7          2   mid           NA          17
    8          2   mid           12          17
    9          2    up           20          10
    10         2    up           40          10
    11         2    up           20           5
    

    Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.

    # listwise delete missing observations
    df_list <- lapply(df_list, function(x) x[complete.cases(x), ])
    

    Finally, we perform a linear regression on each shuffled dataset and store the B coefficients in out.

    row_bind <- function(x) data.frame(do.call(rbind, x))
    out <- row_bind(
      lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
    )
    
    ## out <- row_bind(
    ##   lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
    ##                            family = binomial())$coef)
    ## )
    
    # -------------------------------------------------------------------
    > dim(out)
    [1] 1000    2
    

    Output

    > head(out)
      X.Intercept. low.density
    1     13.74881   0.2804738
    2     20.01074  -0.2095672
    3     30.26643  -0.2946373
    4     29.19541  -0.2752761
    5     37.76273  -0.4555651
    6     37.72250  -0.1548349
    

    enter image description here

    The code required to create this image can be found here.