I'm setting up a machine learning pipeline to classify some data. I have lots of unlabelled data (i.e. target variable is unknown) that I would like to make use of. One of the ways I would like to do this is to use the unlabelled data to fit the transformers in my pipeline. For example, for the variables I am scaling when StandardScaler
is called I want it to fit on the given training data plus the unlabelled data and then transform the training data.
For clarity, outside of a pipeline I can implement it like this:
all_data = pd.concat([labelled_data, unlabelled_data])
s_scaler = StandardScaler()
s_scaler.fit(all_data)
scaled_labelled_df = s_scaler.transform(labelled_data)
Is there a way of implementing this in the sklearn pipeline? I've had a look at the FunctionTransformer
method but don't understand how I could use it in this case.
Defining a new class which inherits from the desired transformer with a modified fit
method should do the trick e.g.
class StandardScaleWULD(StandardScaler):
def __init__(self):
super().__init__()
self.unlabelled_data = UNLABELLED_TRAITS
def fit(self, X, y=None, sample_weight=None):
all_data = pd.concat([X, self.unlabelled_data])
return super().fit(all_data, y, sample_weight)
this new transformer can then be used in the pipeline as usual.