I'm trying to select a couple of hundreds of features out of 60,000 and for this I want to use mutual_info_classif.
But I see that I get different results when I use mutual_info_classif directly, compared with using SelectKBest.
To demonstrate it, I define a small df where only 1 column is correlated with target:
A B C D E target
0 1 1 1 1 1 1
1 2 3 2 2 2 0
2 3 3 3 3 3 0
3 4 3 4 4 4 0
4 5 1 5 5 5 1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
df = pd.DataFrame({'A':[1,2,3,4,5],
'B':[1,3,3,3,1],
'C':[1,2,3,4,5],
'D':[1,2,3,4,5],
'E':[1,2,3,4,5],
'target':[1,0,0,0,1]})
X = df.drop(['target'],axis=1)
y = df.target
threshold = 3 # the number of most relevant features
Then I get MI scores using mutual_info_classif:
high_score_features1 = []
feature_scores = mutual_info_classif(X, y, random_state=0, n_neighbors=3,discrete_features='auto')
for score, f_name in sorted(zip(feature_scores, X.columns), reverse=True)[:threshold]:
print(f_name, score)
high_score_features1.append(f_name)
feature_scores
Output:
B 0.48333333333333306
E 0.0
D 0.0
array([0. , 0.48333333, 0. , 0. , 0. ])
Then I use SelectKBest, and to assure same parameters are used, I'm using my own call:
def my_func(X, y):
return mutual_info_classif(X, y, random_state=0, n_neighbors=3, discrete_features='auto')
high_score_features1=[]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
f_selector = SelectKBest(score_func=my_func, k=threshold)
f_selector.fit(X_train, y_train)
for score, f_name in sorted(zip(f_selector.scores_, X.columns), reverse=True)[:threshold]:
print(f_name, score)
high_score_features1.append(f_name)
f_selector.scores_
Output:
B 0.8333333333333331
E 0.0
D 0.0
array([0. , 0.83333333, 0. , 0. , 0. ])
I don't understand the source of the difference and I'm not sure which way is more reliable to use for my real data.
It seems that the reason you're getting different results between directly using the mutual_info_classif
model and using the SelectKBest
model is because you're fitting them on different datasets. Your SelectKBest
model is being fitted on a training set whereas your mutual_info_classif
is being fitted to the entire data. If you fit both models on the training data then both models give identical output.