Search code examples
rmachine-learningstatisticsfselector

Use of formula in information.gain in R


In the function definition for the FSelector information.gain function,

information.gain(formula, data)

what exactly is the purpose of the formula? I'm trying to use the function to do feature selection for a classification task. In the few examples that I've seen online, it seems like the formula defines some kind of relationship between the class label and the features in the dataset. However, if this is the case, I don't know the exact linear relationship between the features and the labels since I'm performing a classification task, so what would the formula be?


Solution

  • You can use . to tell R that you want to analyse the dependency between a class variable and all other variables in the data frame. For example for the iris dataset:

    > library(FSelector)
    > information.gain(Species~., iris)
                    attr_importance
    Sepal.Length       0.4521286
    Sepal.Width        0.2672750
    Petal.Length       0.9402853
    Petal.Width        0.9554360
    

    If you want to analyse the interaction with respect to only a subset of the variables, you can use explicit names:

    > information.gain(Species~Sepal.Length+Sepal.Width, iris)
                    attr_importance
    Sepal.Length       0.4521286
    Sepal.Width        0.2672750