Search code examples
machine-learningscikit-learnpipeline

Including unlabelled data in sklearn pipeline


I'm setting up a machine learning pipeline to classify some data. I have lots of unlabelled data (i.e. target variable is unknown) that I would like to make use of. One of the ways I would like to do this is to use the unlabelled data to fit the transformers in my pipeline. For example, for the variables I am scaling when StandardScaler is called I want it to fit on the given training data plus the unlabelled data and then transform the training data.

For clarity, outside of a pipeline I can implement it like this:

    all_data =  pd.concat([labelled_data, unlabelled_data])

    s_scaler = StandardScaler()
    s_scaler.fit(all_data)
    scaled_labelled_df = s_scaler.transform(labelled_data)

Is there a way of implementing this in the sklearn pipeline? I've had a look at the FunctionTransformer method but don't understand how I could use it in this case.


Solution

  • Defining a new class which inherits from the desired transformer with a modified fit method should do the trick e.g.

    class StandardScaleWULD(StandardScaler):
        def __init__(self):
            super().__init__()
            self.unlabelled_data = UNLABELLED_TRAITS
    
        def fit(self, X, y=None, sample_weight=None):
            all_data = pd.concat([X, self.unlabelled_data])
            return super().fit(all_data, y, sample_weight)
    

    this new transformer can then be used in the pipeline as usual.