Search code examples
rmachine-learningpipelinexgboostmlr3

Importance based variable reduction


I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.

I did try already two approaches, but I have failed twice.

The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.

# reproducible example
data <- iris

# artificial class imbalancing
data <- iris %>% 
  mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0"))) 

Everything works fine while using simple Learner:

# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")

# creating Learner
lrn <- lrn("classif.xgboost") 

# setting scoring as prediction type 
lrn$predict_type = "prob"

lrn$train(task)
lrn$importance()

 Petal.Width Petal.Length 
  0.90606304   0.09393696 

The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:

I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.

# undersampling
po_under <- po("classbalancing",
               id = "undersample", adjust = "major",
               reference = "major", shuffle = FALSE, ratio = 1 / 2)

# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)

# setting the autoTuner
at <- AutoTuner$new(
  learner = lrn_under,
  resampling = resample,
  measure = measure,
  search_space = ps_under,
  terminator = terminator,
  tuner = tuner
)

at$train(task)

The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.

> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights

So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.

> lrn <- lrn("classif.xgboost", importance = "impurity") 
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
  nie można zmienić wartości zablokowanego połączenia dla 'importance'

Which basically means something like this:

Error in 'instance[[nn]] <- dots[[i]]':  can't change value of blocked connection for 'importance'

I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".

So my question is, is there a way to achieve what I am aiming for?

In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?


Solution

  • The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.

    The trained GraphLearner is saved inside your AutoTuner under $learner:

    # get the trained GraphLearner, with tuned hyperparameters
    graphlearner <- at$learner
    

    This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).

    Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:

    1. Get the untrained Learner object
    2. get the trained state of the Learner and put it into that object
    # get the untrained Learner
    xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
    
    # put the trained model into the Learner
    xgboostlearner$state <- graphlearner$model$classif.xgboost
    

    Now the importance can be queried

    xgboostlearner$importance()
    

    The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.