python machine-learning scikit-learn data-science pipeline

Get instance variable of costum transformer in sklearn pipeline

I am tasked with a supervised learning problem on a dataset and want to create a full Pipeline from complete beginning to end. Starting with the train-test splitting. I wrote a custom class to implement sklearns train_test_split into the sklearn pipeline. Its fit_transform returns the training set. Later i still want to accsess the test set, so i made it an instance variable in the custom transformer class like this:

self.test_set = test_set

from sklearn.model_selection import train_test_split

class train_test_splitter([...])
[... 
...]
    def transform(self, X):
        train_set, test_set = train_test_split(X, test_size=0.2)
        self.test_set = test_set
        return train_set

split_pipeline = Pipeline([
    ('splitter', train_test_splitter() ),    
])
df_train = split_pipeline.fit_transform(df)

Now i want to get the test set like this:

df_test = splitter.test_set

Its not working. How do I get the variables of the instance "splitter". Where does it get stored?

Solution

You can access the steps of a pipeline in a number of ways. For example,

split_pipeline['splitter'].test_set

That said, I don't think this is a good approach. When you fill out the pipeline with more steps, at fit time everything will work how you want, but when predicting/transforming on other data you will still be calling your transform method, which will generate a new train-test split, forgetting the old one, and sending the new train set down the pipe for the remaining steps.