Search code examples
pythonpandasscikit-learnnormalize

What type of normalization happens with sklearn


I have a matrix which I'm trying to normalize by transforming each feature column to zero mean and unit standard deviation.

I have the following code that I'm using, but I want to know if that method actually does what I'm trying to or if it uses a different method.

from sklearn import preprocessing

mat_normalized = preprocessing.normalize(mat_from_df)

Solution

  • sklearn.preprocessing.normalize scales each sample vector to unit norm. (The default axis is 1, not 0.) Here's proof of that:

    from sklearn.preprocessing import normalize
    
    np.random.seed(444)
    data = np.random.normal(loc=5, scale=2, size=(15, 2))
    np.linalg.norm(normalize(data), axis=1)
    # array([ 1.,  1.,  1.,  1.,  1.,  1., ...
    

    It sounds like you're looking for sklearn.preprocessing.scale to scale each feature vector to ~N(0, 1).

    from sklearn.preprocessing import scale
    
    # Are the scaled column-wise means approx. 0.?
    np.allclose(scale(data).mean(axis=0), 0.)
    # True
    
    # Are the scaled column-wise stdevs. approx. 1.?
    np.allclose(scale(data).std(axis=0), 1.)
    # True