I am normalizing the original data by simply subtracting by the mean and dividing by a standard deviation when testing different algorithms such as Logistic Regression, Gaussian Naive Bayes, Random Forest, and Multilayer Perceptron. It's not necessary for all of them, but I am simply trying to be consistent. But the correlation matrices for the features change before and after normalization. When deciding on which features to select to avoid redundancy in input data, should both correlation matrices be considered or only the one after normalization as this is the data being directly fed to the machine learning methods?
I think the correlation matrix should remain the same after a "proper" normalization.
Demo:
In [107]: df = pd.DataFrame(np.random.rand(6,6)) * 100
let's save Pearson correlation matrix before the normalization
In [108]: corr1 = df.corr()
normalization using sklearn.preprocessing.StandardScaler
:
In [109]: from sklearn.preprocessing import StandardScaler
In [110]: scale = StandardScaler()
In [111]: r = scale.fit_transform(df)
save Pearson correlation matrix after the normalization
In [112]: corr2 = pd.DataFrame(r).corr()
compare saved correlation matrices:
In [114]: np.allclose(corr1, corr2)
Out[114]: True