scikit-learn scipy cluster-analysis sklearn-pandas data-preprocessing

Pre-processing data: difference between sklearn StandardScaler and scipy whiten

I'm trying to apply clustering analysis to some data and following some tutorials, and there are two options for normalize the data, the function StandardScaler and the function whiten.

What is the difference between both?

Solution

The StandardScaler() performs Z-score normalization by removing the mean and dividing by the standard deviation. The whiten() does not remove the mean but just divides the instance by the standard deviation. I'd personally go for the StandardScaler(), because it allows for an easy fit_transform on the train set and transform on the test set to prevent spillage/leakage.

Below a proof of principle:

Load the packages

import numpy as np
from scipy.cluster.vq import whiten
from sklearn.preprocessing import StandardScaler

Initialize a features array

features  = np.array([[1.9, 2.3, 1.7],
                      [1.5, 2.5, 2.2],
                      [0.8, 0.6, 1.7,]])

As an example, calculate the mean of the first column and the standard deviation of the first column

meancol1 = np.mean(features[:,0])
std_col1 = np.std(features[:,0])

Execute the whiten function

whit = whiten(features)

Initialize the standard scaler and fit_transform on the features

scaler = StandardScaler()
std_scaler = scaler.fit_transform(features)

Manually calculate the z-score of the first instance

z_score = (features[0,0] - meancol1) / std_col1

Manually perform the whiten function for the first instance

w_score = features[0,0]/std_col1

As you can see, the manual calculations correspond to their respective function outcomes.