Search code examples
pythonscikit-learnstatisticsfeature-selection

How does sklearn.feature_selection.SelectFwe work?


import numpy as np
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn import __version__ #1.0.2

I have this example dataset:

X = np.array([
    [3,0,2,1],
    [7,3,0,5],
    [4,2,5,1],
    [6,2,7,3],
    [3,2,5,2],
    [6,1,1,4]
])
y = np.array([2,9,2,4,5,9])

The f_regression(X, y) function returns two arrays:

(array([ 4.68362124,  0.69456469,  2.59714175, 27.64721141]),
 array([0.09643779, 0.45148854, 0.18234859, 0.00626275]))

The first one contains the F-statistic for the 4 features of my dateset, the second one contains the p-values associated with the F-statistic.

Now suppose I want to extract the features with a p-value lower than 0.15; what I expect is that the first and last features are selected. I would like to use SelectFwe (here the documentation) to perform this step, so:

SelectFwe(f_regression, alpha=.15).fit(X, y).get_support()

Unfortunately it returns array([False, False, False, True]), meaning that only the last feature is selected.

Why does it happen? Did I misunderstand how SelectFwe works? Probably the following picture is helpful:

enter image description here

The code I used to produce the plot:

plt.plot(
    np.linspace(0,1,101),
    [SelectFwe(f_regression, alpha=alpha).fit(X, y).get_support().sum() for alpha in np.linspace(0,1,101)]
)

plt.xlabel("alpha")
plt.ylabel("selected features")
plt.show()

Solution

  • In the source, the alpha is divided by the number of features:

        def _get_support_mask(self):
            check_is_fitted(self)
    
            return self.pvalues_ < self.alpha / len(self.pvalues_)
    

    This is because the class is considering "family-wise error" rate, which is

    the probability of making one or more false discoveries

    (wikipedia, my emph). You can use instead SelectFpr, "false positive rate" test, which works exactly the same but doesn't divide by the number of features. See also Issue1007.