Search code examples
mlr3

mlr3 RFE Termination Metric


This may be a naive question but I would like to use recursive feature elimination with a random forest model and wanted to see if I could terminate based on the feature set that gives the smallest RMSE (like this figure from caret)?

I looked at the documentation and it seems that it defaults to terminating at half of the features chosen if I am not mistaken?

Thanks for your help @be-marc and my apologies for my naivety as this is all new to me. I was trying to implement your suggestion with the code I was already running (see below) but was not sure where to find the archive since I wasn't using the fselect command but rather AutoFSelector and nested resampling:


ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)

set.seed(123, "L'Ecuyer")

task = as_task_regr(ARMSS, target = "Index.ARMSS")

learner = lrn("regr.ranger", importance = "impurity")

set_threads(learner, n = 8)

resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")

at = AutoFSelector$new(
  learner = learner,
  resampling = resampling_inner,
  measure = measure,
  terminator = terminator,
  fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
  store_models = TRUE)

resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)

rr = resample(task, at, resampling_outer, store_models = TRUE)

Should I use the extract_inner_fselect_archives() command then identify under each iteration the smallest RMSE and the features selected? How do I reconcile differences across iterations in the number of features and/or the features selected?


Solution

  • if I could terminate based on the feature set that gives the smallest RMSE

    That makes no sense. You can terminate when one feature is left and then look at the archive to find the feature set with the lowest rmse. You can achieve the same run as caret with feature_fraction = 0.5 and n_features = 1.

    instance = fselect(
      method = fs("rfe", n_features = 1, feature_fraction = 0.5),
      task = tsk("mtcars"),
      learner = lrn("regr.rpart"),
      resampling = rsmp("holdout"),
      measure = msr("regr.rmse"),
      store_models = TRUE
    )
    
    as.data.table(instance$archive)
    
    instance$archive$best()