I have problems to preprocess the dataset as a whole with columntransformer - maybe you can help:
First I read in my dataset:
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=seed)
Then I do my preprocessing:
preprocessor = ColumnTransformer(
transformers=
[
("col_drop", "drop",["col1","col2",]),
('enc_1', BinaryEncoder(), ["Bank"]),
('enc_2', OneHotEncoder(), ["Chair"]),
('log', FunctionTransformer(np.log1p, validate=False), log_features),
('log_p', FunctionTransformer(np.log1p, validate=False), ["target_y]),
('pow', PowerTransformer(method="yeo-johnson"), pow_features)
],
remainder='passthrough',n_jobs=-1)
And after that I call a pipeline with my preprocessor:
pipe.fit_transform(X_train, y_train)
This produces the error: A given column is not a column of the dataframe
And this makes in a way sense, because I use the preprocessor to do a nlog1p function on target_y, which is basically my target feature, which is only present in y_train and y_test. I assume that this causes the error, because the target is not in X_train.
Question: Is it possible to preprocess X and y at once or is it mandatory to use another columntransformer/pipeline for my y values? Is there any good solution for this?
You cannot preprocess targets in a ColumnTransformer
or Pipeline
(unless you plan on putting them together with the independent variables and then splitting them out later); however, there is the TransformedTargetRegressor
(docs) meant for this use-case.