Search code examples
pythonscalenormalizationstandardized

How to scale test set based on the mean and std from train set in python?


I read an answer to explain "Why feature scaling only to training set?" " and the answer says "Standardize any test set using the training set means and standard deviations"

Therefore, I try to fix my previous incorrect operation. However, I check the official document of StandardScaler(), it cannot support to scale with given mean and std. like this:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler(mean = train_x.mean(), var_x = train.std())
sc.fit(test_x)

# this code is incorrect, but what is the correct code?

So, my question is how to scale the test set based on the mean and std from the train set in python.


Solution

  • According to the official documents,

    with_mean bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

    with_std bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).

    So you can just simply do it like this.

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    sc.fit(test_x)
    

    StandardScaler() only takes with_mean and with_std as boolean that means the value of these either True or False.