Search code examples
pythonmachine-learningscikit-learnpipeline

Custom transformer for Scikit Learn Pipeline


I'm using the Scikit learn pipeline object because I have a sequence of tasks to perform (upsampling, feature selection, classification). My upsampling method is a custom one, that means I have to implement a custom transformer for the pipeline.

A transformer must have a transform and fit method. Of course I only want to upsample the training data but not the test data. Does this mean that I only have to implement the fit method but not the transform method (upsampling the dataset passed to the fit method)? As I understand, the transform method is applied to both the training and test set...


Solution

  • scikit-learn transformers can't change number of samples, this is not supported in API - see http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin.fit_transform - note dimensions of X, y and X_new. Also, note that they return only X, not y - it means if you change X dimension it won't longer match y dimension.

    One way to do it is to run it outside the pipeline - generate new samples for training and put them to pipeline, and don't generate new samples for testing. But it won't work e.g. with cross-vaidation.

    To make it work for cross-validation and model selection you'll need a custom Pipeline class which supports transformers which change n_samples. For example, an implementation can be found in imbalanced-learn package: see here. Check this package - if you need upsampling then maybe your upsampling method is already implemented in imbalanced-learn.