python scikit-learn pipeline gridsearchcv

Are the pipeline steps followed when Predicting after GridSearch CV

I'm using GridSearchCV in combination with a pipeline that includes standardization as a first step. I've found that when predicting for a test dataset using the .predict method of GridSearchCV, the results differ from when the pipeline steps are implemented manually. I've created a simplified version of my script below, to show that the errors found differ. For the purpose of simplicity, the search space consists of only 1 value per parameter.

I'm aware that the difference here is very small. But in my original code, this difference is significantly larger. Thus I am trying to understand what causes the difference between the two methods

Initializing the data

import random

import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

random.seed(5)
np.random.seed(5)

x_train = np.random.rand(1000,4)
y_train = np.random.rand(1000,1)

x_test = np.random.rand(100,4)
y_test = np.random.rand(100,1)

C_space = 1
epsilon_space = 0.04
gamma_space = 0.001

Implementing a Pipeline

cv_folds = KFold(n_splits=5, shuffle=True)

steps = [
    ('scaler', StandardScaler()), 
    ('svr', SVR(kernel='rbf', gamma=gamma_space, C=C_space, epsilon=epsilon_space))
    ]
pipe = Pipeline(steps, verbose=0)

search_space = [{
        'svr__gamma':[gamma_space],
        'svr__C':[C_space], 
        'svr__epsilon':[epsilon_space]
        }]

mod = GridSearchCV(pipe, search_space,
    scoring='neg_mean_absolute_error', cv=5, verbose=0, return_train_score=True, refit=True)
svr = mod.fit(x_train, y_train)
y_pred = svr.predict(x_test)

error = mean_absolute_error(y_pred, y_test)

Implementing the steps manually

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

manual_svr = SVR(kernel='rbf', gamma=gamma_space, C=C_space,
    epsilon=epsilon_space).fit(x_train_scaled, y_train)

y_pred_manual = manual_svr.predict(x_test_scaled)

error_manual = mean_absolute_error(y_pred_manual, y_test)

The outcome is:

Pipeline error is: 0.23495746730222067
Manual error is: 0.23487379770774958

Solution

You are fitting the StandardScaler within GridSearchCV to the training data, whereas you are re-fitting your "manual" scaler to the test data. With

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

you are overwriting the scaler fit to the training data!
This is not the way using a scaler is intended to. Fit the scaler to the training data and then use this scaler to standardize your test data.

Let's compare your output with how it should look like. First let's extract the scaler fit in GridSearchCV and standardize the test data with it:

gscv_sclr = mod.best_estimator_.named_steps['scaler']
gscv_test_scld = gscv_sclr.transform(x_test)

As you can see, this is not equal your manually standardized test data:

np.allclose(gscv_test_scld, x_test_scaled)
# Out: False

Now let's fit the "manual" standardizer only with the training data and use this standardizer to transform your test data:

scaler_new = StandardScaler()
x_train_scaled = scaler_new.fit_transform(x_train)
x_test_scaled_new = scaler_new.transform(x_test)

# and compare it to the gridsearchcv scaler:
np.allclose(gscv_test_scld, x_test_scaled_new)
# Out: True

Which is equal!
Now use this correctly standardized test set to make your predictions:

# this refitting is actually not needed. it is simply here for having separate models...
manual_svr_new = SVR(kernel='rbf', gamma=gamma_space, C=C_space,
    epsilon=epsilon_space).fit(x_train_scaled, y_train)

y_pred_manual_new = manual_svr_new.predict(x_test_scaled_new)

error_manual_new = mean_absolute_error(y_pred_manual_new, y_test)

# And test it:
error_manual_new == error
# Out: True

And now you've got the result of your pipeline.