Search code examples
pythonpandasscikit-learnpipeline

How to specify the parameter for FeatureUnion to let it pass to underlying transformer


In my code, I am trying to access the sample_weight of the StandardScaler. However, this StandardScaler is within a Pipeline which again is within a FeatureUnion. I can't seem to get this parameter name correct: scaler_pipeline__scaler__sample_weight which should be specified in the fit method of the preprocessor object.

I get the following error: KeyError: 'scaler_pipeline

What should this parameter name be? Alternatively, if there is a generally better way to do this, feel free to propose it.

The code below is a standalone example.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
import pandas as pd

class ColumnSelector(BaseEstimator, TransformerMixin):
    """Select only specified columns."""

    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.columns]

    def set_output(self, *, transform=None):
        return self

df = pd.DataFrame({'ds':[1,2,3,4],'y':[1,2,3,4],'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
sample_weight=[0,1,1,1]

scaler_pipeline = Pipeline(
    [
        (
            "selector",
            ColumnSelector(['a','b']),
        ),
        ("scaler", StandardScaler()),
    ]
)

remaining_pipeline = Pipeline([("selector", ColumnSelector(["ds","y"]))])

# Featureunion fitting training data
preprocessor = FeatureUnion(
    transformer_list=[
        ("scaler_pipeline", scaler_pipeline),
        ("remaining_pipeline", remaining_pipeline),
    ]
).set_output(transform="pandas")

df_training_transformed = preprocessor.fit_transform(
    df, scaler_pipeline__scaler__sample_weight=sample_weight
)


Solution

  • fit_transform has no parameter called scaler_pipeline__scaler__sample_weight.

    Instead, it is expecting to receive "parameters passed to the fit method of each step" as a dict of string, "where each parameter name is prefixed such that parameter p for step s has key s__p".

    So, in your example, it should be:

    df_training_transformed = preprocessor.fit_transform(
        df, {"scaler_pipeline__scaler__sample_weight":sample_weight}
    )