mlr: Unexpected result when comparing output from getFilteredFeatures and generateFilterValuesData

I am using two different methods to get the features selected by a filter. I expected these methods to return the same values but they are not doing so and I do not understand why. The reason for using the second method is so that I can access the scores that were used to select the features, not just the names of the selected features.

The filter is a univariate model score filter using the Cox model to measure performance, and selecting the top 5 features. I create a resample instance, so that both methods are using the same samples in each fold.

The first method is the usual one - using makeFilterWrapper to wrap the filter around a Lasso model, and calling getFilteredFeatures via the extract option of resample. My understanding is that getFilteredFeatures returns the features that were selected by the filter,before being passed to the Lasso model.

In the second method, I use subsetTask to create the same sub-task that getFilteredFeatures would be using in each CV fold, and then I call generateFilterValuesData to get the values the filter generated. The top 5 values on this list in each fold should match the values returned from getFilteredFeatures in each fold, but they do not. Why is this?

library(survival)
#> Warning: package 'survival' was built under R version 3.5.3
library(mlr)
#> Loading required package: ParamHelpers

data(veteran)
task_id = "VET"
vet.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)

inner = makeResampleDesc("CV", iters=2, stratify=TRUE)  # Tuning
outer = makeResampleDesc("CV", iters=2, stratify=TRUE)  # Benchmarking

set.seed(24601)
resinst = makeResampleInstance(desc=outer, task=vet.task)

cox.lrn <- makeLearner(cl="surv.coxph", id = "coxph", predict.type="response")
lasso.lrn  <- makeLearner(cl="surv.cvglmnet", id = "lasso", predict.type="response", alpha = 1, nfolds=5)

filt.uni.lrn = 
  makeFilterWrapper(
    lasso.lrn, 
    fw.method="univariate.model.score", 
    perf.learner=cox.lrn,
    fw.abs = 5
  )
#Method 1
res = resample(learner=filt.uni.lrn, task=vet.task, resampling=resinst, measures=list(cindex), extract=getFilteredFeatures)
#> Resampling: cross-validation
#> Measures:             cindex
#> [Resample] iter 1:    0.7458904
#> [Resample] iter 2:    0.6575813
#> 
#> Aggregated Result: cindex.test.mean=0.7017359
#> 
res$extract
#> [[1]]
#> [1] "karno"              "diagtime"           "celltype.squamous" 
#> [4] "celltype.smallcell" "celltype.adeno"    
#> 
#> [[2]]
#> [1] "karno"              "diagtime"           "age"               
#> [4] "celltype.smallcell" "celltype.large"

#Method 2
for (i in 1:2) {
  subt = subsetTask(task=vet.task, subset = resinst$train.inds[[i]])
  print(generateFilterValuesData(subt, method="univariate.model.score", perf.learner=cox.lrn))
}
#> FilterValues:
#> Task: VET
#>                 name    type                 method     value
#> 2              karno numeric univariate.model.score 0.6387665
#> 7 celltype.smallcell numeric univariate.model.score 0.6219512
#> 8     celltype.adeno numeric univariate.model.score 0.5700000
#> 5              prior numeric univariate.model.score 0.5456522
#> 6  celltype.squamous numeric univariate.model.score 0.5316206
#> 4                age numeric univariate.model.score 0.5104603
#> 1                trt numeric univariate.model.score 0.5063830
#> 3           diagtime numeric univariate.model.score 0.4760956
#> 9     celltype.large numeric univariate.model.score 0.3766520
#> FilterValues:
#> Task: VET
#>                 name    type                 method     value
#> 2              karno numeric univariate.model.score 0.6931330
#> 9     celltype.large numeric univariate.model.score 0.6264822
#> 7 celltype.smallcell numeric univariate.model.score 0.5269058
#> 6  celltype.squamous numeric univariate.model.score 0.5081967
#> 8     celltype.adeno numeric univariate.model.score 0.5064655
#> 4                age numeric univariate.model.score 0.4980237
#> 1                trt numeric univariate.model.score 0.4646018
#> 3           diagtime numeric univariate.model.score 0.4547619
#> 5              prior numeric univariate.model.score 0.4527897

^{Created on 2019-10-02 by the reprex package (v0.3.0)}

Solution

You are mixing up two things here.

Case 1 (Nested resampling)

The features selected in the outer fold of the nested resampling are determined from the best performing fold of the inner resampling.

Fold 1 (inner) -> calculate top 5 features with filter -> calculate model performance
Fold 2 (inner) -> calculate top 5 features with filter -> calculate model performance
Check which inner fold had the best performance (let's assume fold 1) -> take the top 5 features from this fold for model fitting in the outer fold

Hence, filter values will actually not be calculated on the outer fold but only on the inner ones. You are basically asking "give me the top 5 features according to the filter from the inner loop and only train the model on these in the outer fold". Because filter values are not recalculated again in the outer fold, you only get the feature names back and no values.

Case 2 (Direct calculation of filter values)

Here, you generate the filter values directly on the two outer folds. Since observations are different from the ones of the inner folds in the nested resampling (case 1), your Lasso learner will come up with different filter scores (the model fitting happens with different observations) and possibly a different ranking.

IIUC your thinking is that the filter values are generated again for each outer fold in the nested resampling setting. This is not the case and also would not have any benefit as the features to fit the model with have already been chosen during optimization in the inner folds.

For the outer folds, the model is only trained with the selected features suggested from the inner loop. The same logic would apply to tuning: "Give me the best hyperparameters across all folds from the inner loop (I'll tell you how to do so) and then fit a model on the outer fold using these settings".

Maybe it helps to transfer this logic to tuning: You also would not call tuneParams() standalone on each outer fold and assume that you get the same hyperparameters returned as your inner nested resampling optimization would have come up with, wouldn't you?