machine-learning data-science correlation feature-extraction feature-selection

What does Pearson correlation tell when features are uncorrelated

I have a dataset (31 features including the class). This dataset is about to be used for a classification problem. I thought to check the correlation between the features using Pearson correlation exists in pandas. When I set the Pearson's threshold > 0.5, I get the following:

import pandas as pd

data = pd.read_csv("../dataset.csv")
cor = data.corr(method='pearson')
cor_target = abs(cor['Class'])
result = cor_target[cor_target > 0.5]
print(result)

The result is:

Class    1.0
Name: Class, dtype: float64

It turns out that all 30 features are not correlated at all. What does this mean? Is it always a good indicator that features are independent?

Thank you.

Solution

Your assumptions are somewhat wrong.

Take for an example:

import pandas as pd

data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
cor = data.corr(method='pearson')
print(cor)
cor_target = abs(cor['Class'])
print(cor_target)
result = cor_target[cor_target > 0.5]
print(result)

              a         b     Class
a      1.000000  1.000000  0.288675
b      1.000000  1.000000  0.288675
Class  0.288675  0.288675  1.000000
a        0.288675
b        0.288675
Class    1.000000
Name: Class, dtype: float64
Class    1.0
Name: Class, dtype: float64

Feature set a and b are exactly the same, they have 1.0 correlation, but you'll still get only 1.

Remove the class labels, and only observe the correlation between the intermediate features.

Observe the correlation matrix and select the ones with low correlation.

import pandas as pd

data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
cor = data[['a', 'b']].corr(method='pearson')
print(cor)
cor_target = abs(cor)

     a    b
a  1.0  1.0
b  1.0  1.0

If you want to use labels, try scikit-learn's feature importance, https://scikit-learn.org/stable/modules/feature_selection.html