I'm trying to apply clustering analysis to some data and following some tutorials, and there are two options for normalize the data, the function StandardScaler and the function whiten.
What is the difference between both?
The StandardScaler() performs Z-score normalization by removing the mean and dividing by the standard deviation. The whiten() does not remove the mean but just divides the instance by the standard deviation. I'd personally go for the StandardScaler(), because it allows for an easy fit_transform
on the train set and transform
on the test set to prevent spillage/leakage.
Below a proof of principle:
Load the packages
import numpy as np
from scipy.cluster.vq import whiten
from sklearn.preprocessing import StandardScaler
Initialize a features array
features = np.array([[1.9, 2.3, 1.7],
[1.5, 2.5, 2.2],
[0.8, 0.6, 1.7,]])
As an example, calculate the mean of the first column and the standard deviation of the first column
meancol1 = np.mean(features[:,0])
std_col1 = np.std(features[:,0])
Execute the whiten function
whit = whiten(features)
Initialize the standard scaler and fit_transform on the features
scaler = StandardScaler()
std_scaler = scaler.fit_transform(features)
Manually calculate the z-score of the first instance
z_score = (features[0,0] - meancol1) / std_col1
Manually perform the whiten function for the first instance
w_score = features[0,0]/std_col1
As you can see, the manual calculations correspond to their respective function outcomes.