python machine-learning regression data-science scaling

How to scale a data set to the same scale as another?

I’m currently scaling my training data for my regression model and the data I eventually put into the model for the prediction separately using StandardScaler.

Would this scale the prediction data down to the same level as the training data’s scaling or is it different? And will it therefore lead to incorrect predictions?

If so, how can I scale the second data set with the same mean etc as the training set? Would I have to manually apply the formula to the second data set using the mean and variance of the former?

Thanks

Solution

When you scale your data, you should only scale to the training data. Otherwise, the range of your prediction/test data affects how the training data is scaled and thus what your model learns. This is a form of data leakage.

In Python you would look something like:

    scaler = StandardScalar() # Create a scalar
    scaler.fit(training_data) # Fit only to training data
    scaled_training_data = scaler.transform(training_data) # What your model learns on
    scaled_test_data = scaler.transform(test_data) # Scale your test data using the same scaling as the training data

(note: You can fit and transform your training data in one step using fit_transform() .)