import numpy as np
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn import __version__ #1.0.2
I have this example dataset:
X = np.array([
[3,0,2,1],
[7,3,0,5],
[4,2,5,1],
[6,2,7,3],
[3,2,5,2],
[6,1,1,4]
])
y = np.array([2,9,2,4,5,9])
The f_regression(X, y)
function returns two arrays:
(array([ 4.68362124, 0.69456469, 2.59714175, 27.64721141]),
array([0.09643779, 0.45148854, 0.18234859, 0.00626275]))
The first one contains the F-statistic for the 4 features of my dateset, the second one contains the p-values associated with the F-statistic.
Now suppose I want to extract the features with a p-value lower than 0.15; what I expect is that the first and last features are selected. I would like to use SelectFwe
(here the documentation) to perform this step, so:
SelectFwe(f_regression, alpha=.15).fit(X, y).get_support()
Unfortunately it returns array([False, False, False, True])
, meaning that only the last feature is selected.
Why does it happen? Did I misunderstand how SelectFwe
works? Probably the following picture is helpful:
The code I used to produce the plot:
plt.plot(
np.linspace(0,1,101),
[SelectFwe(f_regression, alpha=alpha).fit(X, y).get_support().sum() for alpha in np.linspace(0,1,101)]
)
plt.xlabel("alpha")
plt.ylabel("selected features")
plt.show()
In the source, the alpha is divided by the number of features:
def _get_support_mask(self):
check_is_fitted(self)
return self.pvalues_ < self.alpha / len(self.pvalues_)
This is because the class is considering "family-wise error" rate, which is
the probability of making one or more false discoveries
(wikipedia, my emph). You can use instead SelectFpr
, "false positive rate" test, which works exactly the same but doesn't divide by the number of features. See also Issue1007.