Search code examples
pythonpandassklearn-pandas

Different scores when mutual_info_classif used independently and through SelectKBest


I'm trying to select a couple of hundreds of features out of 60,000 and for this I want to use mutual_info_classif.
But I see that I get different results when I use mutual_info_classif directly, compared with using SelectKBest.

To demonstrate it, I define a small df where only 1 column is correlated with target:

    A   B   C   D   E  target  
0   1   1   1   1   1   1  
1   2   3   2   2   2   0  
2   3   3   3   3   3   0  
3   4   3   4   4   4   0  
4   5   1   5   5   5   1  
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

df = pd.DataFrame({'A':[1,2,3,4,5], 
                   'B':[1,3,3,3,1], 
                   'C':[1,2,3,4,5], 
                   'D':[1,2,3,4,5], 
                   'E':[1,2,3,4,5], 
                   'target':[1,0,0,0,1]})

X = df.drop(['target'],axis=1)
y = df.target
threshold = 3  # the number of most relevant features

Then I get MI scores using mutual_info_classif:

high_score_features1 = []
feature_scores = mutual_info_classif(X, y, random_state=0, n_neighbors=3,discrete_features='auto')
for score, f_name in sorted(zip(feature_scores, X.columns), reverse=True)[:threshold]:
        print(f_name, score)
        high_score_features1.append(f_name)

feature_scores 

Output:

B 0.48333333333333306  
E 0.0  
D 0.0  
array([0.        , 0.48333333, 0.        , 0.        , 0.        ])  

Then I use SelectKBest, and to assure same parameters are used, I'm using my own call:

def my_func(X, y):
    return mutual_info_classif(X, y, random_state=0, n_neighbors=3, discrete_features='auto')

high_score_features1=[]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

f_selector = SelectKBest(score_func=my_func, k=threshold)
f_selector.fit(X_train, y_train)
for score, f_name in sorted(zip(f_selector.scores_, X.columns), reverse=True)[:threshold]:
        print(f_name, score)
        high_score_features1.append(f_name)

f_selector.scores_

Output:

    B 0.8333333333333331  
    E 0.0  
    D 0.0  
array([0.        , 0.83333333, 0.        , 0.        , 0.        ])  

I don't understand the source of the difference and I'm not sure which way is more reliable to use for my real data.


Solution

  • It seems that the reason you're getting different results between directly using the mutual_info_classif model and using the SelectKBest model is because you're fitting them on different datasets. Your SelectKBest model is being fitted on a training set whereas your mutual_info_classif is being fitted to the entire data. If you fit both models on the training data then both models give identical output.