Search code examples
pythonmachine-learningscikit-learnpipelinefeature-selection

Using different features for the same estimator in the pipeline


I have a nice pipeline that does the following:

pipeline = Pipeline([
    ("first transformer", ct),
    ("second transformer", OHE),
    ('standard_scaler', MinMaxScaler()),
    ("logistic regression", estimator)
])

The estimator part is this:

estimator = MultiOutputClassifier(
    estimator = LogisticRegression(penalty="l2", C=2)
)

Label DataFrame is of shape (1000, 2) and all works nicely so far.

To tweak the model I now try to add SelectKBest to limit the features used for calculations. Unfortunately adding this code to the pipeline:

('feature_selection', SelectKBest(score_func=f_regression, k=9))

returns this error:

ValueError: y should be a 1d array, got an array of shape (20030, 2) instead.

I understand where it comes from and using only one label (1000, 1) solves the issue but that means I would need to create two separate pipelines for each label.

Is there any way of including feature selection in this pipeline without resorting to that?


Solution

  • Since you want (potentially) to use a different subset of features for each output, you should just put the SelectKBest in a pipeline with the LogisticRegression inside the MultiOutputClassifier.

    clf = Pipeline([
        ("feature_selection", SelectKBest(score_func=f_regression, k=9)),
        ("logistic regression", LogisticRegression(penalty="l2", C=2)),
    ])
    estimator = MultiOutputClassifier(clf)
    
    pipeline = Pipeline([
        ("first transformer", ct),
        ("second transformer", OHE),
        ('standard_scaler', MinMaxScaler()),
        ("select_and_model", estimator),
    ])