I'm generating random samples for binary classification problem:
X, y = make_classification(n_features=40, n_redundant=4, n_informative=36,n_clusters_per_class=2, n_samples=50000)
I want to check the correlation between the features and the target (for feature selection
I'm using 2 different methods:
1. correlation (pearson)
2. mutual information
I'm getting small score, for both methods between the features and the target.
Mutual information:
from sklearn.feature_selection import mutual_info_classif
res1 = mutual_info_classif(X, y)
df = pd.DataFrame(data=X)
df['Taregt'] = y
res2 = df.apply(lambda x: x.corr(df['Taregt']))
For both methods I'm getting results which are less than 0.4
Why am I getting small score ? I expect to get higher score ? What am I missing ?
This is an artificially generated random classification dataset, made by the convenience function make_classification
of scikit-learn. There is absolutely no reason to expect that there will be any particular value range for the correlation coefficients between the features and the label. In fact, a simple experiment shows that there is indeed a range of correlation values, going as high as ~ 0.65 (positive or negative) and as low as about zero, as expected in such random data; keeping n_features=10
for brevity::
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
for i in range(10):
X, y = make_classification(n_features=10, n_redundant=4, n_informative=6,n_clusters_per_class=2, n_samples=50000)
df = pd.DataFrame(data=X)
df['Target'] = y
res2 = df.apply(lambda x: x.corr(df['Target']))
0 -0.299619
1 0.019879
2 -0.271226
3 0.324632
4 -0.299824
5 0.277574
6 0.028462
7 0.395118
8 0.297397
9 0.001334
Target 1.000000
dtype: float64
0 -0.008546
1 -0.131875
2 0.009582
3 0.314725
4 0.292152
5 0.002754
6 0.203895
7 0.009530
8 -0.314609
9 0.310828
Target 1.000000
dtype: float64
0 0.061911
1 0.648200
2 -0.293845
3 0.002402
4 0.592591
5 -0.387568
6 0.277449
7 0.574272
8 -0.448803
9 -0.000266
Target 1.000000
dtype: float64
0 0.289361
1 0.306837
2 -0.565776
3 0.018211
4 -0.001650
5 -0.008317
6 -0.318271
7 0.025830
8 -0.001511
9 0.461342
Target 1.000000
dtype: float64
0 0.316292
1 0.223331
2 -0.001817
3 0.423708
4 -0.466166
5 -0.283735
6 -0.212039
7 0.311600
8 -0.292352
9 0.302497
Target 1.000000
dtype: float64
0 0.006351
1 -0.004631
2 -0.331184
3 0.083991
4 0.002227
5 -0.000883
6 -0.123998
7 0.374792
8 -0.087007
9 0.530111
Target 1.000000
dtype: float64
0 -0.278837
1 0.360339
2 -0.407622
3 -0.026460
4 -0.275985
5 -0.007404
6 0.295955
7 -0.290008
8 0.293710
9 0.138187
Target 1.000000
dtype: float64
0 0.005973
1 -0.182802
2 -0.001029
3 -0.000993
4 0.207585
5 0.002144
6 0.298949
7 -0.288891
8 -0.277202
9 -0.203653
Target 1.000000
dtype: float64
0 0.298933
1 0.000461
2 -0.004837
3 0.290285
4 -0.013016
5 -0.003280
6 -0.131817
7 0.048733
8 -0.032910
9 0.002162
Target 1.000000
dtype: float64
0 0.494809
1 0.382098
2 0.549377
3 0.004632
4 0.300572
5 -0.486202
6 -0.581924
7 0.300024
8 0.308240
9 -0.398422
Target 1.000000
dtype: float64
Looking at the correlation values alone, we cannot even be sure as of which features are the informative (here 6) and which are the redundant ones (here 4).
In short: there is nothing to be explained here, plus that your finding of "less than 0.4" is not accurate.
Similar arguments hold for the mutual information, too.