Search code examples
rrandom-forestr-caret

R caret: Combine rfe() and train()


I want to combine recursive feature elimination with rfe() and tuning together with model selection with trainControl() using the method rf (random forest). Instead of the standard summary statistic I would like to have the MAPE (mean absolute percentage error). Therefore I tried the following code using the ChickWeight data set:

library(caret)
library(randomForest)
library(MLmetrics)

# Compute MAPE instead of other metrics
mape <- function(data, lev = NULL, model = NULL){
  mape <- MAPE(y_pred = data$pred, y_true = data$obs)
  c(MAPE = mape)
}

# specify trainControl
trc <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid", savePred =T,
                    summaryFunction = mape)
# set up grid
tunegrid <- expand.grid(.mtry=c(1:3))

# specify rfeControl
rfec <- rfeControl(functions=rfFuncs, method="cv", number=10, saveDetails = TRUE)

set.seed(42)
results <- rfe(weight ~ Time + Chick + Diet, 
           sizes=c(1:3), # number of predictors from which should algorithm chose the best predictor
           data = ChickWeight, 
           method="rf",
           ntree = 250, 
           metric= "RMSE", 
           tuneGrid=tunegrid,
           rfeControl=rfec,
           trControl = trc)

The code runs without errors. But where do I find the MAPE, which I defined as a summaryFunction in trainControl? Is trainControlexecuted or ignored?

How could I rewrite the code in order to do recursive feature elimination with rfe and then tune the hyperparameter mtry using trainControl within rfe and at the same time compute an additional error measure (MAPE)?


Solution

  • trainControl is ignored, as its description

    Control the computational nuances of the train function

    would suggest. To use MAPE, you want

    rfec$functions$summary <- mape
    

    Then

    rfe(weight ~ Time + Chick + Diet, 
        sizes = c(1:3),
        data = ChickWeight, 
        method ="rf",
        ntree = 250, 
        metric = "MAPE", # Modified
        maximize = FALSE, # Modified
        rfeControl = rfec)
    #
    # Recursive feature selection
    #
    # Outer resampling method: Cross-Validated (10 fold) 
    #
    # Resampling performance over subset size:
    #
    #  Variables   MAPE  MAPESD Selected
    #          1 0.1903 0.03190         
    #          2 0.1029 0.01727        *
    #          3 0.1326 0.02136         
    #         53 0.1303 0.02041         
    #
    # The top 2 variables (out of 2):
    #    Time, Chick.L