Search code examples
feature-selectionaucmlr3

AUC filter in ml3filters


I try to understand more about the AUC filter of mlr3filters. I use a classification task (task) and the following code:

filter = flt("auc")

filter$calculate(task)

result<-as.data.table(filter)

From the documentation in mlr3measures::auc(), I understand that I need a vector of probabilities and a vector with (binary) factor values as well as the "true" class. In my task, I have the binary class (as "target") and many features which are numeric, but not between 0 and 1, so I cannot interpret them as a probability. Why is AUC calculated then? Or is there an additional assumption? My problem is that I cannot read this from filter$help().

As a general question: Is there an additional "explanation layer" between the function references in https://mlr3filters.mlr-org.com/reference/index.html and the underlying R functions? For example, I understand that FilterVariance$new() generates a filter object that calculates the variances of the single features by only using these features and applying stats::var(). But from the book, I also see that I can specify cutoff values:

po("filter", mlr3filters::FilterVariance$new(), filter.frac = 0.5)

Where do I find details about this filter.frac value? I cannot find it in filter$help() and also not in stats::var()

Similarly, I understand that FilterCorrelation$new() generates a filter object that takes the single features and the target to calculate the feature ranks. This may be self-explanatory, but I wonder where I could find more details about such issues.

I tried the answer that I found here (Filtering in mlr3filters - where can I find details about the methods?) , but I could not find details in filter$help()

Thanks in advance for beginners$help()


Solution

  • While the documentation of the AUC filter states that it is analogous to mlr3measures::auc, it does not actually use that function. Instead, it uses its own implementation which calculates the Area Under the ("Receiver Operating Characteristic") Curve, which does not require a probability value, only a continuous value that can be used to divide samples with a cutoff.

    The Filters, as I understand them, are primarily used to calculate filter scores (see the documentation of the Filter base class). They are mostly useful in combination with the PipeOpFilter from the mlr3pipelines package, which is also what the example from the book that you cite is doing. PipeOps are generally useful to integrate a filtering step (or any kind of preprocessing) into a learning process (the book chapter on mlr3pipelines is probably a good point to learn about that). However, these can also be used to compute steps, e.g. filter out columns from a Task:

    pof = po("filter", mlr3filters::FilterVariance$new(), filter.frac = 0.5)
    task = tsk("iris")
    pof_result = pof$train(list(task))
    pof_result[[1]]
    #> <TaskClassif:iris> (150 x 3)
    #> * Target: Species
    #> * Properties: multiclass
    #> * Features (2):
    #>   - dbl (2): Petal.Length, Sepal.Length
    pof_result[[1]]$data()
    #>        Species Petal.Length Sepal.Length
    #>   1:    setosa          1.4          5.1
    #>   2:    setosa          1.4          4.9
    #>   3:    setosa          1.3          4.7
    #>   4:    setosa          1.5          4.6
    #>   5:    setosa          1.4          5.0
    #>  ---                                    
    #> 146: virginica          5.2          6.7
    #> 147: virginica          5.0          6.3
    #> 148: virginica          5.2          6.5
    #> 149: virginica          5.4          6.2
    #> 150: virginica          5.1          5.9
    

    See ?PipeOp about the $train() method (although really the book chapter is a better place to start imho). The ?PipeOpFilter documents the filter.frac hyperparameter:

    Parameters:
    
    [...]
    
       • ‘filter.nfeat’ :: ‘numeric(1)’
         Number of features to select. Mutually exclusive with ‘frac’
         and ‘cutoff’.
    
       • ‘filter.frac’ :: ‘numeric(1)’
         Fraction of features to keep. Mutually exclusive with ‘nfeat’
         and ‘cutoff’.
    
       • ‘filter.cutoff’ :: ‘numeric(1)’
         Minimum value of filter heuristic for which to keep features.
         Mutually exclusive with ‘nfeat’ and ‘frac’.
    
    Note that at least one of ‘filter.nfeat’, ‘filter.frac’, or
    ‘filter.cutoff’ must be given.
    

    So what it is doing is to run the Filter's $calculate() function and select features based on the resulting $score.

    To answer the more general question on where to get help: Usually the object classes (which you can usually find out by calling class(object)) are a good first start to look at; often these classes inherit from more general base classes that also explain some parts of the process if they are not familiar. In this example the PipeOpFilter inherits from several classes, among them PipeOp -- this is stated in the help file. Besides that, there is the book, which you know about. Finally, if everything else fails, it may become necessary to look at the source code, unfortunately.