Search code examples
rloopsregressionresamplingstatistics-bootstrap

Get multiple regression coefficient values calculated by resampling


I am calculating the coefficients of multiple linear models using resampling. Before I was using the boot function, but in the future need to include new statistics in the analysis so I think this way is better. A replicable example:

iris <- iris[,1:4]

nboots <- 100
ncol = ncol(iris)
boot.r.squared <- numeric(nboots)      
boot.p_model <- numeric(nboots)
boot.coef_p <- numeric(nboots)
boot.coef_estimate <- matrix(nrow= nboots,ncol=ncol)
boot.coef_error <- matrix(nrow= nboots,ncol=ncol)
boot.coef_t <- matrix(nrow= nboots,ncol=ncol)
boot.coef_p <- matrix(nrow= nboots,ncol=ncol)

for(i in 1:nboots){
  boot.samp <- iris[sample(nrow(iris),size = 100, replace = TRUE,), ] 
  model <- lm(boot.samp$Sepal.Length ~ .,boot.samp)
  model.sum <- summary(model)

  boot.r.squared[i] <- model.sum$r.squared
  stat <- model.sum$fstatistic
  boot.p_model[i] <- pf(stat[1], stat[2], stat[3], lower.tail = FALSE)

  boot.coef_estimate[i, 1:length(model$coefficients)] <- model$coefficients[1]
  boot.coef_error[i, 1:length(model$coefficients)] <- model$coefficients[2]
  boot.coef_t[i, 1:length(model$coefficients)] <- model$coefficients[3]
  boot.coef_p[i, 1:length(model$coefficients)] <- model$coefficients[4] 
}

But I can't correctly save the coefficients in the form of a matrix. I would like to save 4 matrices containing in each: parameter, the error, statistic t and p value.

With this code all columns are the same. I tried put [,1] to save the first column, but this error happens. How can i fix this error?

Error in model $ coefficients [, 1]: incorrect number of dimensions


Solution

  • Your code works for me without errors. You probably experimented with different numbers of repetitions and forgot to update the sizes of the empty objects you define before the loop, since I could reprocude your error when I changed to nboots <- 1000.

    You could do that without a for loop and use replicate() instead (which is basically a wrapper for lapply()). The advantage is that it might be a little more straightforward and you won't need to define the empty objects beforehand.

    The procedure is, first, to define a bootstrap function, say bootFun(), second, the actual bootstrap using replicate() which yields an array, and finally third, to process the data in the array into the desired matrices.

    # Step 1 bootstrap function
    bootFun <- function(X) {
      fit <- lm(Sepal.Length ~ ., data=X)
      s <- summary(fit)
      r2 <- s$r.squared
      f <- s$fstatistic
      p.f <- pf(f[1], f[2], f[3], lower.tail = FALSE)
      ms <- s$coefficients
      return(cbind(ms, r2, p.f))
    }
    
    # Step 2 bootstrap
    nboots <- 1e2
    set.seed(42)  # for sake of reproducibility
    A <- replicate(nboots, bootFun(iris[sample(1:nrow(iris), size=nrow(iris), replace=TRUE), ]))
    # Note: I used here `size=nrow(iris)`, but you also could use size=100 if this is your method
    
    # Step 3 process array ("de-transpose" and transform to list)
    Ap <- aperm(A, c(3, 1, 2))  # aperm() is similar to t() for matrices
    Aps <- asplit(Ap, 3)  # split the array into a handy list
    
    # or in one step
    Aps <- asplit(aperm(A, c(3, 1, 2)), 3)
    

    Result

    Now you have a list containing all the matrices, look at the names of the list objects.

    names(Aps)
    # [1] "Estimate"   "Std. Error" "t value"    "Pr(>|t|)"   "r2"         "p.f"
    

    So, if we e.g. want the matrix of our estimates, we simply do:

    estimates <- Aps$Estimate
    
    estimates[1:3, ]
    #      (Intercept) Sepal.Width Petal.Length Petal.Width
    # [1,]    1.353531   0.7655760    0.8322749  -0.7775090
    # [2,]    1.777431   0.6653308    0.7353491  -0.6024095
    # [3,]    2.029428   0.5825554    0.6941457  -0.4795787
    

    Note, that the F-statistics and the R2 are also matrices and duplicated in each row for every coefficient. Since you only need one, just extract e.g. the first column:

    Aps$p.f[1:3, ]
    #       (Intercept)  Sepal.Width Petal.Length  Petal.Width
    # [1,] 2.759019e-65 2.759019e-65 2.759019e-65 2.759019e-65
    # [2,] 5.451912e-66 5.451912e-66 5.451912e-66 5.451912e-66
    # [3,] 3.288712e-54 3.288712e-54 3.288712e-54 3.288712e-54
    
    Aps$p.f[1:3, 1]
    # [1] 2.759019e-65 5.451912e-66 3.288712e-54
    

    Benchmark

    The speed of both approaches is practically the same with a small advantage for replicate(). Here a benchmark comparing both methods where I used nboot=1,000 replications.

    # Unit: seconds
    #      expr      min       lq     mean   median       uq      max neval cld
    #   forloop 2.215259 2.235797 2.332234 2.381035 2.401933 2.700622   100   a
    # replicate 2.218291 2.240570 2.313526 2.257905 2.400532 2.601958   100   a