I'm working on making a DataFrame pre-processing pipeline using sklearn and chaining various types of pre-processing steps.
I wanted to chain a SimpleImputer
transformer and a FunctionTransformer
applying a pd.qcut
(or pd.cut
) but I keep getting the following error:
ValueError: Input array must be 1 dimensional
Here's my code:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, features):
self._features = features
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self._features]
fare_transformer = Pipeline([
('fare_selector', FeatureSelector(['Fare'])),
('fare_imputer', SimpleImputer(strategy='median')),
('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])
The same happens if I simply chain the FeatureSelector
transformer and the FunctionTransformer
with pd.qcut
and omit the SimpleImputer
:
fare_transformer = Pipeline([
('fare_selector', FeatureSelector(['Fare'])),
('fare_bands', FunctionTransformer(func=pd.qcut, kw_args={'q': 5}))
])
I searched stackoverflow and google extensively but could not find a solution to this issue. Any help here would be greatly appreciated!
sklearn already has such a transformer, KBinsDiscretizer
(to match pd.qcut
, use strategy='quantile'
). It will differ primarily in how it transform
s test data: the FunctionTransformer
version will "refit" the quantiles, whereas the builtin KBinsDiscretizer
will save the quantile statistics for binning test data. As @m_power notes in a comment, they also differ near bin edges, as well as the format of the transformed data.
But to address the error specifically: it means your function qcut
only applies to a 1D array, whereas FunctionTransformer
sends the entire dataframe. You can define a thin wrapper around qcut
to make this work, like
def frame_qcut(X, y=None, q=10):
return X.apply(pd.qcut, axis=0, q=q)
(That's assuming you'll get a dataframe in.)