Search code examples
machine-learningscikit-learnpearson-correlation

Why I'm getting small score of correlation and mutual information?


I'm generating random samples for binary classification problem:

X, y = make_classification(n_features=40, n_redundant=4, n_informative=36,n_clusters_per_class=2, n_samples=50000)

I want to check the correlation between the features and the target (for feature selection step).

I'm using 2 different methods:

1. correlation (pearson)
2. mutual information

I'm getting small score, for both methods between the features and the target.

Mutual information:

    from sklearn.feature_selection import mutual_info_classif
    res1 = mutual_info_classif(X, y)

Correlation:

df = pd.DataFrame(data=X)
df['Taregt'] = y
res2 = df.apply(lambda x: x.corr(df['Taregt']))

For both methods I'm getting results which are less than 0.4

Why am I getting small score ? I expect to get higher score ? What am I missing ?


Solution

  • This is an artificially generated random classification dataset, made by the convenience function make_classification of scikit-learn. There is absolutely no reason to expect that there will be any particular value range for the correlation coefficients between the features and the label. In fact, a simple experiment shows that there is indeed a range of correlation values, going as high as ~ 0.65 (positive or negative) and as low as about zero, as expected in such random data; keeping n_features=10 for brevity::

    from sklearn.datasets import make_classification
    from sklearn.feature_selection import mutual_info_classif
    import pandas as pd
    
    for i in range(10):
      X, y = make_classification(n_features=10, n_redundant=4, n_informative=6,n_clusters_per_class=2, n_samples=50000)
      df = pd.DataFrame(data=X)
      df['Target'] = y
      res2 = df.apply(lambda x: x.corr(df['Target']))
      print(res2)
    

    Result:

    0        -0.299619
    1         0.019879
    2        -0.271226
    3         0.324632
    4        -0.299824
    5         0.277574
    6         0.028462
    7         0.395118
    8         0.297397
    9         0.001334
    Target    1.000000
    dtype: float64
    0        -0.008546
    1        -0.131875
    2         0.009582
    3         0.314725
    4         0.292152
    5         0.002754
    6         0.203895
    7         0.009530
    8        -0.314609
    9         0.310828
    Target    1.000000
    dtype: float64
    0         0.061911
    1         0.648200
    2        -0.293845
    3         0.002402
    4         0.592591
    5        -0.387568
    6         0.277449
    7         0.574272
    8        -0.448803
    9        -0.000266
    Target    1.000000
    dtype: float64
    0         0.289361
    1         0.306837
    2        -0.565776
    3         0.018211
    4        -0.001650
    5        -0.008317
    6        -0.318271
    7         0.025830
    8        -0.001511
    9         0.461342
    Target    1.000000
    dtype: float64
    0         0.316292
    1         0.223331
    2        -0.001817
    3         0.423708
    4        -0.466166
    5        -0.283735
    6        -0.212039
    7         0.311600
    8        -0.292352
    9         0.302497
    Target    1.000000
    dtype: float64
    0         0.006351
    1        -0.004631
    2        -0.331184
    3         0.083991
    4         0.002227
    5        -0.000883
    6        -0.123998
    7         0.374792
    8        -0.087007
    9         0.530111
    Target    1.000000
    dtype: float64
    0        -0.278837
    1         0.360339
    2        -0.407622
    3        -0.026460
    4        -0.275985
    5        -0.007404
    6         0.295955
    7        -0.290008
    8         0.293710
    9         0.138187
    Target    1.000000
    dtype: float64
    0         0.005973
    1        -0.182802
    2        -0.001029
    3        -0.000993
    4         0.207585
    5         0.002144
    6         0.298949
    7        -0.288891
    8        -0.277202
    9        -0.203653
    Target    1.000000
    dtype: float64
    0         0.298933
    1         0.000461
    2        -0.004837
    3         0.290285
    4        -0.013016
    5        -0.003280
    6        -0.131817
    7         0.048733
    8        -0.032910
    9         0.002162
    Target    1.000000
    dtype: float64
    0         0.494809
    1         0.382098
    2         0.549377
    3         0.004632
    4         0.300572
    5        -0.486202
    6        -0.581924
    7         0.300024
    8         0.308240
    9        -0.398422
    Target    1.000000
    dtype: float64
    

    Looking at the correlation values alone, we cannot even be sure as of which features are the informative (here 6) and which are the redundant ones (here 4).

    In short: there is nothing to be explained here, plus that your finding of "less than 0.4" is not accurate.

    Similar arguments hold for the mutual information, too.