Search code examples
normalizationpcastandardization

Why do we need to standardize data before PCA?


I tried to understand what we should do before PCA: standartization (x-m)/s or normalization (scale into [0, 1] interval). In the sklearn tutorial they use standardization and show that PCA with standardization performs better:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

And I found the same in other answers, e.g this one states that PCA will pay more attention to features with higher variance, so you should make all vatiances the same:

https://datascience.stackexchange.com/questions/86448/principal-components-analysis-need-standardization-or-normalization

But this is an approach of PCA - maximize the variance of projected data. How will PCA maximize the variance if standardizised data have the same variance in all directions?


Solution

  • Because PCA changes the base of the variables, If you are measuring highly correlated data then you can change the base to let most of the variance in few variables.

    Imagine the extreme case you have two variables x,y that they are measuring exactly (or almost) the same thing, they will have a correlation of 1, meaning that you are better of using as variables t1=1/2(x+y), t2=1/2(x-y) , t1 will contain all the variance of the original and t2 will have no variance, so it can be later removed.