Search code examples
scikit-learndata-sciencepipelinexgboostdata-preprocessing

Pipeline with XGBoost - Imputer and Scaler prevent Model from learning


I'm trying to build a pipeline for data preprocessing for my XGBoost model. The data contains NaNs and needs to be scaled. This is the relevant code:

xgb_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', preprocessing.StandardScaler()),
    ('regressor', xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective = "reg:squarederror"))])

xgb_pipe.fit(train_x.values, train_y.values, 
            regressor__early_stopping_rounds=20, 
            regressor__eval_metric = "rmse", 
            regressor__eval_set = [[train_x.values, train_y.values],[test_x.values, test_y.values]])

The loss immediately increases and the training stops after 20 iterations.

If I remove the imputer and the scaler from the pipeline, it works and trains for the full 100 iterations. If I manually preprocess the data it also works as intended, so I know that the problem is not the data. What am I missing?


Solution

  • The problem is that the preprocessing doesn't get applied to your eval sets, and so the model performs quite badly on them, and early stopping kicks in very early.

    I'm not sure there's a simple way to do this that would keep everything in one pipeline, unfortunately. You need to apply the preprocessing steps of the pipeline to the eval sets, so those need to be fitted before setting that parameter.

    Separate preprocessing

    As two objects it's no problem:

    preproc = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', preprocessing.StandardScaler()),
    ])
    
    reg = xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective="reg:squarederror")
    
    train_x_preproc = preproc.fit_transform(train_x.values, train_y.values)
    test_x_preproc = preproc.transform(test_x)
    
    reg.fit(train_x.values, train_y.values, 
        regressor__early_stopping_rounds=20, 
        regressor__eval_metric = "rmse", 
        regressor__eval_set = [[train_x_preproc, train_y.values], [test_x_preproc, test_y.values]],
    )
    

    After fitting you could put these now-fitted estimators together into a pipeline (pipelines don't clone their estimators) for prediction if you'd like.

    Custom estimator

    There are a lot of ways to go about this, but inheriting from Pipeline means you can initialize the same way you do your current setup, and we just assume the last step is an xgboost model, and the rest are preprocessing that need to apply to the eval sets as well as fitting and predicting sets. I think everything else can be left to the inherited methods from Pipeline?

    class PreprocEarlyStoppingXGB(Pipeline):
        def fit(self, X, y, eval_set):
            preproc = self.steps[:-1]
            X_preproc = preproc.fit_transform(X, y)
            eval_preproc = []
            for eval in eval_set:
                eval_preproc.append([preproc.transform(eval[0]), eval[1]])
            self.steps[-1].fit(X_preproc, y, eval_set=eval_preproc)
            return self
    

    To your usecase from the comments, what happens when you cross-validate with this object? On each training fold, the preprocessing steps are fitted. Those are then applied to the training fold, and all eval sets (the entire training set as well as the external test set), and finally when scoring the test fold. The xgboost model trains on the preprocessed training fold, and watches the score on the entire training set and the external testing set (both having been preprocessed), the latter getting used for early stopping.