machine-learning normalization feature-selection pearson-correlation

Feature selection via correlation matrix

I am normalizing the original data by simply subtracting by the mean and dividing by a standard deviation when testing different algorithms such as Logistic Regression, Gaussian Naive Bayes, Random Forest, and Multilayer Perceptron. It's not necessary for all of them, but I am simply trying to be consistent. But the correlation matrices for the features change before and after normalization. When deciding on which features to select to avoid redundancy in input data, should both correlation matrices be considered or only the one after normalization as this is the data being directly fed to the machine learning methods?

Solution

I think the correlation matrix should remain the same after a "proper" normalization.

Demo:

In [107]: df = pd.DataFrame(np.random.rand(6,6)) * 100

let's save Pearson correlation matrix before the normalization

In [108]: corr1 = df.corr()

normalization using sklearn.preprocessing.StandardScaler:

In [109]: from sklearn.preprocessing import StandardScaler

In [110]: scale = StandardScaler()

In [111]: r = scale.fit_transform(df)

save Pearson correlation matrix after the normalization

In [112]: corr2 = pd.DataFrame(r).corr()

compare saved correlation matrices:

In [114]: np.allclose(corr1, corr2)
Out[114]: True