Search code examples
machine-learningscaleforecastingpredict

Scaling Out-of-Sample Forecasts in a Model with Normalized Variables: Reverting to Original Scale


I'm working on making forecasts using a model where variables were scaled by $ x_i = \frac{{x_i - \text{mean}(x_i)}}{{\text{sd}(x_i)}} $, and I've saved the mean and standard deviation. Now, for out-of-sample forecasts, let's say for the target variable $ ( x_i )$, based on the scaled model, how do I scale the forecasts back?

Should I use the in-sample $ \text{Mean}(x_i) $ and $ \text{sd}(x_i) $ to scale the out-of-sample forecasts back, so that:

$ \text{Re-scaled out-of-sample forecast} = \text{Scaled forecast} \times \text{sd}(x_i) + \text{mean}(x_i) $

What's the appropriate procedure here?

Python example:

X = np.random.randn(100, 1) * 10 + 50  # Feature
y = 2 * X + 1 + np.random.randn(100, 1) * 5  # Target variable
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
y_train_scaled = scaler_y.fit_transform(y_train)

model = LinearRegression()
model.fit(X_train_scaled, y_train_scaled)

Solution

  • You should indeed use the in-sample mean and standard deviation to rescale the forecasts back to the original scale because of the following reasons:

    • Consistency: Your model was trained on data scaled with these parameters, so using the same parameters for rescaling maintains consistency.
    • Avoiding data leakage: Using out-of-sample statistics for rescaling would introduce information that wasn't available during model training, which could lead to biased results.

    Rescale the predictions:

    X_test_scaled = scaler_X.transform(X_test)
    y_pred_scaled = model.predict(X_test_scaled)
    y_pred = scaler_y.inverse_transform(y_pred_scaled)
    

    Manual rescaling:

    y_pred_manual = y_pred_scaled * scaler_y.scale_ + scaler_y.mean_