Search code examples
machine-learningdata-sciencecorrelationfeature-extractionfeature-selection

What does Pearson correlation tell when features are uncorrelated


I have a dataset (31 features including the class). This dataset is about to be used for a classification problem. I thought to check the correlation between the features using Pearson correlation exists in pandas. When I set the Pearson's threshold > 0.5, I get the following:

import pandas as pd

data = pd.read_csv("../dataset.csv")
cor = data.corr(method='pearson')
cor_target = abs(cor['Class'])
result = cor_target[cor_target > 0.5]
print(result)

The result is:

Class    1.0
Name: Class, dtype: float64

It turns out that all 30 features are not correlated at all. What does this mean? Is it always a good indicator that features are independent?

Thank you.


Solution

  • Your assumptions are somewhat wrong.

    Take for an example:

    import pandas as pd
    
    data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
    cor = data.corr(method='pearson')
    print(cor)
    cor_target = abs(cor['Class'])
    print(cor_target)
    result = cor_target[cor_target > 0.5]
    print(result)
    
                  a         b     Class
    a      1.000000  1.000000  0.288675
    b      1.000000  1.000000  0.288675
    Class  0.288675  0.288675  1.000000
    a        0.288675
    b        0.288675
    Class    1.000000
    Name: Class, dtype: float64
    Class    1.0
    Name: Class, dtype: float64
    

    Feature set a and b are exactly the same, they have 1.0 correlation, but you'll still get only 1.

    Remove the class labels, and only observe the correlation between the intermediate features.

    Observe the correlation matrix and select the ones with low correlation.

    import pandas as pd
    
    data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
    cor = data[['a', 'b']].corr(method='pearson')
    print(cor)
    cor_target = abs(cor)
    
    
         a    b
    a  1.0  1.0
    b  1.0  1.0
    

    If you want to use labels, try scikit-learn's feature importance, https://scikit-learn.org/stable/modules/feature_selection.html