I have a dataset (31 features including the class). This dataset is about to be used for a classification problem. I thought to check the correlation between the features using Pearson correlation exists in pandas
. When I set the Pearson's threshold > 0.5
, I get the following:
import pandas as pd
data = pd.read_csv("../dataset.csv")
cor = data.corr(method='pearson')
cor_target = abs(cor['Class'])
result = cor_target[cor_target > 0.5]
print(result)
The result is:
Class 1.0
Name: Class, dtype: float64
It turns out that all 30
features are not correlated at all. What does this mean? Is it always a good indicator that features are independent?
Thank you.
Your assumptions are somewhat wrong.
Take for an example:
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
cor = data.corr(method='pearson')
print(cor)
cor_target = abs(cor['Class'])
print(cor_target)
result = cor_target[cor_target > 0.5]
print(result)
a b Class
a 1.000000 1.000000 0.288675
b 1.000000 1.000000 0.288675
Class 0.288675 0.288675 1.000000
a 0.288675
b 0.288675
Class 1.000000
Name: Class, dtype: float64
Class 1.0
Name: Class, dtype: float64
Feature set a
and b
are exactly the same, they have 1.0 correlation, but you'll still get only 1
.
Remove the class labels, and only observe the correlation between the intermediate features.
Observe the correlation matrix and select the ones with low correlation.
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'Class' : [0, 1, 1, 0, 1]})
cor = data[['a', 'b']].corr(method='pearson')
print(cor)
cor_target = abs(cor)
a b
a 1.0 1.0
b 1.0 1.0
If you want to use labels, try scikit-learn's feature importance, https://scikit-learn.org/stable/modules/feature_selection.html