I am trying to code a custom transformer to be used in a pipeline to pre-process data.
Here is the code I'm using (sourced - not written by me). It takes in a dataframe, scales the features, and returns a dataframe:
class DFStandardScaler(BaseEstimator,TransformerMixin):
def __init__(self):
self.ss = None
def fit(self,X,y=None):
self.ss = StandardScaler().fit(X)
return self
def transform(self, X):
Xss = self.ss.transform(X)
Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
return Xscaled
I have data that has both categorical and continuous features. Obviously the transformer will not transform the categorical feature ('sex'). When I fit this pipeline with the dataframe below it throws an error because it is trying to scale the categorical labels in 'sex':
sex length diameter height whole_weight shucked_weight \
0 M 0.455 0.365 0.095 0.5140 0.2245
1 M 0.350 0.265 0.090 0.2255 0.0995
2 F 0.530 0.420 0.135 0.6770 0.2565
3 M 0.440 0.365 0.125 0.5160 0.2155
4 I 0.330 0.255 0.080 0.2050 0.0895
5 I 0.425 0.300 0.095 0.3515 0.1410
How do I pass a list of categorical / continuous features into the transformer so it will scale the proper features? Or is it better to somehow code the feature type check inside the transformer?
Basically you need another step in the Pipeline with a similar class inheriting from BaseEstimator
and TransformerMixin
class ColumnSelector(BaseEstimator,TransformerMixin):
def __init__(self, columns: list):
self.cols = columns
def fit(self,X,y=None):
return self
def transform(self, X, y=None):
return X.loc[:, self.cols]
Then in your main the pipeline looks like this:
selector = ColumnSelector(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight'])
pipe = pipeline.make_pipeline(
selector,
DFStandardScaler()
)
pipe2 = pipeline.make_pipeline(#some steps for the sex column)
full_pipeline = pipeline.make_pipeline(
pipeline.make_union(
pipe,
pipe2
),
#some other step
)