Search code examples
mlr3

RFE Termination Using RMSE with AutoFSelector


To mimic how caret performs RFE and select features that produce the lowest RMSE, it was suggested to use the archive.

I am using AutoFSelector and nested resampling with the following code:


ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)

set.seed(123, "L'Ecuyer")

task = as_task_regr(ARMSS, target = "Index.ARMSS")

learner = lrn("regr.ranger", importance = "impurity")

set_threads(learner, n = 8)

resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")

at = AutoFSelector$new(
  learner = learner,
  resampling = resampling_inner,
  measure = measure,
  terminator = terminator,
  fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
  store_models = TRUE)

resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)

rr = resample(task, at, resampling_outer, store_models = TRUE)

Should I use the extract_inner_fselect_archives() command to identify each iteration with the smallest RMSE and the features that were selected then rereun the code above with the n_features argument changed? How do I reconcile differences across iterations in the number of features and/or the features selected?


Solution

  • Nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset, it is not a procedure to select optimal hyperparameters. Nested resampling produces many hyperparameter configurations which should not be used to construct a final model.

    mlr3book Chapter 4 - Optimization.

    The same is true for feature selection. You don't select a feature set with nested resampling. You estimate the performance of the final model.

    it was suggested to use the archive

    Without nested resampling, you just call instance$result or at$fselect_result to get the feature subset with the lowest rmse.