I have a nice pipeline that does the following:
pipeline = Pipeline([
("first transformer", ct),
("second transformer", OHE),
('standard_scaler', MinMaxScaler()),
("logistic regression", estimator)
])
The estimator part is this:
estimator = MultiOutputClassifier(
estimator = LogisticRegression(penalty="l2", C=2)
)
Label DataFrame is of shape (1000, 2) and all works nicely so far.
To tweak the model I now try to add SelectKBest to limit the features used for calculations. Unfortunately adding this code to the pipeline:
('feature_selection', SelectKBest(score_func=f_regression, k=9))
returns this error:
ValueError: y should be a 1d array, got an array of shape (20030, 2) instead.
I understand where it comes from and using only one label (1000, 1) solves the issue but that means I would need to create two separate pipelines for each label.
Is there any way of including feature selection in this pipeline without resorting to that?
Since you want (potentially) to use a different subset of features for each output, you should just put the SelectKBest
in a pipeline with the LogisticRegression
inside the MultiOutputClassifier
.
clf = Pipeline([
("feature_selection", SelectKBest(score_func=f_regression, k=9)),
("logistic regression", LogisticRegression(penalty="l2", C=2)),
])
estimator = MultiOutputClassifier(clf)
pipeline = Pipeline([
("first transformer", ct),
("second transformer", OHE),
('standard_scaler', MinMaxScaler()),
("select_and_model", estimator),
])