Search code examples
rstatistics-bootstrapgbm

Save Gradient Boosting Machine values obtained with Bootstrap


I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.

But I can't correctly save the variable name with it's importance calculated in each bootstrap.

I'm doing this using a function, which is called within the bootstrap package boost command.

Below is a minimally reproducible example adapted for AmesHousing data:

library(gbm)
library(boot)
library(AmesHousing)

df <- make_ames()

imp_gbm <- function(data, indices) {
  d <- data[indices,]
  gbm.fit <- gbm(
    formula = Sale_Price ~ .,
    distribution = "gaussian",
    data = d,
    n.trees = 100,
    interaction.depth = 5,
    shrinkage = 0.1,
    cv.folds = 5,
    n.cores = NULL,
    verbose = FALSE
  )

 return(summary(gbm.fit)[,2])
}

results_GBM <- boot(data = df,statistic = imp_gbm, R=100)

results_GBM$t0

I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.


Solution

  • with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.

    imp_gbm <- function(data, indices) {
      d <- data[indices,]
      # use gbmfit because gbm.fit is a function
      gbmfit <- gbm(
        formula = Sale_Price ~ .,
        distribution = "gaussian",
        data = d,
        n.trees = 100,
        interaction.depth = 5,
        shrinkage = 0.1,
        cv.folds = 5,
        n.cores = NULL,
        verbose = FALSE
      )
      o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
      names(o) = gbmfit$var.names
      return(o)
    }