Search code examples
rmachine-learningfeature-selectionmlr

Feature importance of learner used in benchmark experiment - mlr


I am using mlr package in R to compare two learners, i.e. random forest and lasso classifier, on a binary classification task. I used nested cross-validation to compute performance. Then, I would like to compute the features' importance for the best classifier, random forest in this case. To achieve this I am using generateFeatureImportanceData() which: "Estimate how important individual features or groups of features are by contrasting prediction performances. For method “permutation.importance” compute the change in performance from permuting the values of a feature (or a group of features) and compare that to the predictions made on the unmcuted data." As I specified measure = auc, does the output res provides the decrease in auc for each feature from permuting its value?

library(easypackages)

libraries("mlr","purrr","glmnet","parallelMap","parallel")

data = read.table("data_past.txt", h = T)

set.seed(123)

task = makeClassifTask(id = "past_history", data = data, target = 
"DIAG", positive = "BD")

#specifying hyperparameters for random forest
ps_rf = makeParamSet(makeIntegerParam("mtry", lower = 4, upper = 
16),makeDiscreteParam("ntree", values = 1000))

ctrl_rf = makeTuneControlRandom(maxit = 10L)

inner = makeResampleDesc("RepCV", fold = 10, reps = 3, stratify = TRUE)

lrn_rf = makeLearner("classif.randomForest", predict.type = "prob", 
fix.factors.prediction = TRUE)

lrn_rf = makeTuneWrapper(lrn_rf, resampling = inner, par.set = ps_rf, 
control = ctrl_rf, measures = auc, show.info = FALSE)

parallelStartMulticore(36)

ft_im = generateFeatureImportanceData(task = task, method = 
"permutation.importance", learner = lrn_rf, measure = auc) 

parallelStop()

t(ft_im$res)
                                auc
INC2_A                 0.000000e+00
INC2_B                 0.000000e+00
INC2_F                 0.000000e+00
INC2_G                 0.000000e+00
INC2_H                 0.000000e+00
INC2_I                 0.000000e+00
SEX                    0.000000e+00
marital               -3.211696e-07
inpatient              0.000000e+00
CMS_1                  0.000000e+00
CMS_2                  0.000000e+00
CMS_3                  0.000000e+00
CMS_4                  0.000000e+00
CMS_5                  0.000000e+00
CMS_6                  0.000000e+00
CMS_7                  0.000000e+00
CMS_8                  0.000000e+00
CMS_9                  0.000000e+00
CMS_10                 0.000000e+00
CMS_11                 0.000000e+00
CMS_12                 0.000000e+00
CMS_13                 0.000000e+00
CMS_14                 0.000000e+00
OCS_1                  0.000000e+00
OCS_2                  0.000000e+00
OCS_3                  0.000000e+00
OCS_4                  0.000000e+00
OCS_5                  0.000000e+00
OCS_6                  0.000000e+00
OCS_7                  0.000000e+00
OCS_8                  0.000000e+00
OCS_9                  0.000000e+00
OCS_10                 0.000000e+00
OCS_11                 0.000000e+00
reta                   0.000000e+00
MH_F1                 -1.051220e-03
CP_1BA                 0.000000e+00
CP_1BS                 0.000000e+00
MIXCLINICAL3           0.000000e+00
MIXCLINICAL2           0.000000e+00
MIXDS52Simpt           0.000000e+00
MIXDS53Simpt           0.000000e+00
PAN                    0.000000e+00
OBS                    0.000000e+00
PHO                    0.000000e+00
GAD                    0.000000e+00
EAT_0                  0.000000e+00
ADHD                   0.000000e+00
BORDERLINEPERSONALITY  0.000000e+00
AlcoolProbUse          0.000000e+00
SubstanceProbUse       0.000000e+00
BMI                   -2.954760e-06
DEP_AGE               -7.996641e-04
NBD_P                 -1.669455e-03
NBDEP                 -8.671578e-06
NBSUI                 -2.055485e-06
NBHOS                 -8.091225e-03
DURDEP                -1.380869e-04
SEV_M                 -3.083132e-03
SEV_D                  0.000000e+00
CMS_sum                0.000000e+00
TOTMIXDSM5             0.000000e+00
GAF                   -1.170663e-05
Age                   -1.172269e-06
Comorbidities_sum      0.000000e+00

Are the features with the highest absolute value the more important ones? Does zero value for auc mean that the feature is irrelevant for the classification task at hand? Thank you.


Solution

  • The score of a feature is obtaining by subtracting the normal prediction score of your model to the prediction score obtained with the permuted feature.

    Therefore, features with AUC drop = 0 are irrelevant in the sense that they do not bring any value added (they are as important as if they were purely random noise). On the other hand features with the highest absolute values are the most important, as changing them influences the score the most.