sklearn pipeline methods besides fit and predict

I'm working on a financial project that requires to write my own transformer and estimator, so I have a scores_ method under my transformer and a residuals_ method under my estimator:

Eg:

class my_transformer(base.BaseEstimator, base.TransformerMixin):
    def __init__(self, some_arg):
        self.some_arg=some_arg
    
    def fit(self, X, y=None):
        return self

    def scores_(self, X):
        return somefunc(X, y, self.some_arg)
    
    def transform(self, X):
        scores=self.scores_(X)
        return factorSelect(scores,X)

and

class my_estimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, some_arg):
        self.some_arg=some_arg            
    
    def fit(self, X, y):
        some_other_func(X,y,self.some_arg)
        return self

    def predict(self, X):
        result=some_other_other_func(X)
        return result  
    
    def residuals(self, X, y):
        result=self.predict(X)
        return result.sub(y,axis='index')

So now I setup my pipeline like this:

pipe=Pipeline([
    ('transformer', my_transformer(some_arg)),
    ('Estimator',my_estimator(some_arg))
    ])

This works well when doing fit and predict:

pipe.fit(X,y)
pipe.predict(X)

However, I need to get the scores and residuals for the following steps. I can only get access by:

pipe['transformer'].scores(X)
pipe['estimator'].residuals(Not_X,y)

Here, the .scores works well, but for .residuals, I have to put in Not_X=pipe['transformer'].fit_transform(X) instead of X..

This is troublesome and contradicts the purpose of using the pipeline... So, how should I do it with PipeLine? If PipeLine won't do this, any other suggestions???

Thank you!

Solution

How should I do it with Pipeline?

The same way you handle any programming task interfacing with other people's code -- add layers till it does what you want ;)

In particular, the functionality you want isn't that complicated. The hardest part of the problem is figuring out what you want the syntax to look like (it's not obvious for example that you would always want to transform just the first argument of any function you're accessing), and given sklearn's focus on clean docs and simple APIs it's not surprising that they haven't tackled that your exact use case yet.

Something like the following should serve as inspiration (the only really finicky bit is using __getattr__ to override attribute access -- that's the reason we can type some_partial_pipeline.residuals even though PartialPipeline doesn't have a residuals attribute. See the getattr docs):

from sklearn.pipeline import Pipeline as _Pipeline

class PartialPipeline:
    """Represents a sequence of steps without any of the bells and whistles of a sklearn.pipeline.Pipeline"""
    def __init__(self, steps):
        self._steps = steps
    
    def __getattr__(self, attr_name):
        obj = getattr(self._steps[-1][-1], attr_name)
        if not callable(obj):
            return obj
        def _f(X, *args, **kwargs):
            for _,v in self._steps[:-1]:
                X = v.transform(X)
            return obj(X, *args, **kwargs)
        return _f

class Pipeline(_Pipeline):
    """Wrapper around sklearn.pipeline.Pipeline allowing easy access to the attributes of its steps"""
    def __init__(self, steps, *, memory=None, verbose=False):
        self.__steps = dict(steps)  # Python >=3.6 for dict ordering
        self.__memory = memory
        self.__verbose = verbose
        super().__init__(steps, memory=memory, verbose=verbose)
    
    def at(self, step_name):
        i = list(self.__steps).index(step_name)
        return PartialPipeline(list(self.__steps.items())[:i+1])

#
# Example use
#

pipe=Pipeline([
    ('transformer', my_transformer(some_arg))
  , ('estimator', my_estimator(some_arg))
])

scores = pipe.at('transformer').scores(X)
residuals = pipe.at('estimator').residuals(X, y)

# The thing you were already trying
custom_residuals = pipe['estimator'].residuals(X_pretransformed, y)

# E.g., if the estimator were an MLPRegressor
loss_values = pipe.at('estimator').loss_curve_
loss_values = pipe['estimator'].loss_curve_