Search code examples
pythonmachine-learningregressiondata-sciencescaling

How to scale a data set to the same scale as another?


I’m currently scaling my training data for my regression model and the data I eventually put into the model for the prediction separately using StandardScaler.

Would this scale the prediction data down to the same level as the training data’s scaling or is it different? And will it therefore lead to incorrect predictions?

If so, how can I scale the second data set with the same mean etc as the training set? Would I have to manually apply the formula to the second data set using the mean and variance of the former?

Thanks


Solution

  • When you scale your data, you should only scale to the training data. Otherwise, the range of your prediction/test data affects how the training data is scaled and thus what your model learns. This is a form of data leakage.

    In Python you would look something like:

        scaler = StandardScalar() # Create a scalar
        scaler.fit(training_data) # Fit only to training data
        scaled_training_data = scaler.transform(training_data) # What your model learns on
        scaled_test_data = scaler.transform(test_data) # Scale your test data using the same scaling as the training data
    

    (note: You can fit and transform your training data in one step using fit_transform() .)