Principle Component Analysis

I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization and feature scaling. However, I have no idea about what mean normalization is and how it can be implemented.

At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?

Solution

Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations. In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2). Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.

When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.

How do you Standardize..? Its pretty simple. The Formula is straight forward.

z = (X - u) / s

Where, 

X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column

Note: You have to apply standardization with respect to all feature in the dataset

Reference:

https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html