Search code examples
pythonpandasscikit-learnpipeline

Is there a way to chain a pd.cut FunctionTransformer in a sklearn Pipeline?


I'm working on making a DataFrame pre-processing pipeline using sklearn and chaining various types of pre-processing steps.

I wanted to chain a SimpleImputer transformer and a FunctionTransformer applying a pd.qcut (or pd.cut) but I keep getting the following error:

ValueError: Input array must be 1 dimensional

Here's my code:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self._features = features

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self._features]

fare_transformer = Pipeline([
    ('fare_selector', FeatureSelector(['Fare'])),
    ('fare_imputer', SimpleImputer(strategy='median')),
    ('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])

The same happens if I simply chain the FeatureSelector transformer and the FunctionTransformer with pd.qcut and omit the SimpleImputer:

fare_transformer = Pipeline([
    ('fare_selector', FeatureSelector(['Fare'])),
    ('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])

I searched stackoverflow and google extensively but could not find a solution to this issue. Any help here would be greatly appreciated!


Solution

  • sklearn already has such a transformer, KBinsDiscretizer (to match pd.qcut, use strategy='quantile'). It will differ primarily in how it transforms test data: the FunctionTransformer version will "refit" the quantiles, whereas the builtin KBinsDiscretizer will save the quantile statistics for binning test data. As @m_power notes in a comment, they also differ near bin edges, as well as the format of the transformed data.

    But to address the error specifically: it means your function qcut only applies to a 1D array, whereas FunctionTransformer sends the entire dataframe. You can define a thin wrapper around qcut to make this work, like

    def frame_qcut(X, y=None, q=10):
        return X.apply(pd.qcut, axis=0, q=q)
    

    (That's assuming you'll get a dataframe in.)