Search code examples
scikit-learnscipycluster-analysissklearn-pandasdata-preprocessing

Pre-processing data: difference between sklearn StandardScaler and scipy whiten


I'm trying to apply clustering analysis to some data and following some tutorials, and there are two options for normalize the data, the function StandardScaler and the function whiten.

What is the difference between both?


Solution

  • The StandardScaler() performs Z-score normalization by removing the mean and dividing by the standard deviation. The whiten() does not remove the mean but just divides the instance by the standard deviation. I'd personally go for the StandardScaler(), because it allows for an easy fit_transform on the train set and transform on the test set to prevent spillage/leakage.

    Below a proof of principle:

    Load the packages

    import numpy as np
    from scipy.cluster.vq import whiten
    from sklearn.preprocessing import StandardScaler
    

    Initialize a features array

    features  = np.array([[1.9, 2.3, 1.7],
                          [1.5, 2.5, 2.2],
                          [0.8, 0.6, 1.7,]])
    

    As an example, calculate the mean of the first column and the standard deviation of the first column

    meancol1 = np.mean(features[:,0])
    std_col1 = np.std(features[:,0])
    

    Execute the whiten function

    whit = whiten(features)
    

    Initialize the standard scaler and fit_transform on the features

    scaler = StandardScaler()
    std_scaler = scaler.fit_transform(features)
    

    Manually calculate the z-score of the first instance

    z_score = (features[0,0] - meancol1) / std_col1
    

    Manually perform the whiten function for the first instance

    w_score = features[0,0]/std_col1
    

    As you can see, the manual calculations correspond to their respective function outcomes.