Search code examples
pythonmachine-learningscikit-learnsparse-matrix

How exactly does Standard Scaling a Sparse Matrix work?


I am currently reading "Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow" and came across a Tip stating "If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the mean (as this would break sparsity)." so I tried it out to understand what it is doing. However, the result does not seem to be scaled at all. I created a csr_matrix from a np.array and used a StandardScaler with with_mean=False as parameter. After that, I fit_transformed the matrix. The non-zero results are all the same and nothing is scaled. I don't even understand how the results are calculated. I thought the mean value is set to zero and we are scaling every non-zero value based on the standard deviation of its corresponding column but this method would've given me the scaled value 1.732 which is not the same as the output.

from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import numpy as np

X = csr_matrix(np.array([[0, 0, 1], [0, 2, 0], [3, 0, 0]]))
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print(X_scaled.toarray())


This outputs:

  (0, 2)    2.1213203435596424
  (1, 1)    2.1213203435596424
  (2, 0)    2.1213203435596424

[[0.         0.         2.12132034]
 [0.         2.12132034 0.        ]
 [2.12132034 0.         0.        ]]

Am I doing something wrong or am I misunderstanding something?

I'm not sure if this is what I expected.


Solution

  • With with_mean=False, StandardScaler is only dividing each column by its standard deviation.
    As you can see below, for any number i, this will return 2.12132...

    for i in range(1,4):
        s = np.std([0,0,i])
        print(f"{i} / {s:0.5f} = {i/s:0.5f}")
    
    >>> 1 / 0.47140 = 2.12132
    >>> 2 / 0.94281 = 2.12132
    >>> 3 / 1.41421 = 2.12132