I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization
and feature scaling
. However, I have no idea about what mean normalization is and how it can be implemented.
At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?
Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations. In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2). Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.
When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.
How do you Standardize..? Its pretty simple. The Formula is straight forward.
z = (X - u) / s
Where,
X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column
Note: You have to apply standardization with respect to all feature in the dataset
Reference:
https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html