Search code examples
pythonscikit-learnpipelinetraining-data

Using Column Transformer in Scikit to preprocess train and test data with target variable


I have problems to preprocess the dataset as a whole with columntransformer - maybe you can help:

First I read in my dataset:

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=seed)

Then I do my preprocessing:

preprocessor = ColumnTransformer(
    transformers=
    [
        ("col_drop", "drop",["col1","col2",]),
        ('enc_1', BinaryEncoder(), ["Bank"]),
        ('enc_2', OneHotEncoder(), ["Chair"]),
        ('log', FunctionTransformer(np.log1p, validate=False), log_features),
        ('log_p', FunctionTransformer(np.log1p, validate=False), ["target_y]),
        ('pow', PowerTransformer(method="yeo-johnson"), pow_features)
      
    ],
     remainder='passthrough',n_jobs=-1)

And after that I call a pipeline with my preprocessor:

pipe.fit_transform(X_train, y_train)

This produces the error: A given column is not a column of the dataframe

And this makes in a way sense, because I use the preprocessor to do a nlog1p function on target_y, which is basically my target feature, which is only present in y_train and y_test. I assume that this causes the error, because the target is not in X_train.

Question: Is it possible to preprocess X and y at once or is it mandatory to use another columntransformer/pipeline for my y values? Is there any good solution for this?


Solution

  • You cannot preprocess targets in a ColumnTransformer or Pipeline (unless you plan on putting them together with the independent variables and then splitting them out later); however, there is the TransformedTargetRegressor (docs) meant for this use-case.