I'm working on a financial project that requires to write my own transformer and estimator, so I have a scores_ method under my transformer and a residuals_ method under my estimator:
Eg:
class my_transformer(base.BaseEstimator, base.TransformerMixin):
def __init__(self, some_arg):
self.some_arg=some_arg
def fit(self, X, y=None):
return self
def scores_(self, X):
return somefunc(X, y, self.some_arg)
def transform(self, X):
scores=self.scores_(X)
return factorSelect(scores,X)
and
class my_estimator(base.BaseEstimator, base.RegressorMixin):
def __init__(self, some_arg):
self.some_arg=some_arg
def fit(self, X, y):
some_other_func(X,y,self.some_arg)
return self
def predict(self, X):
result=some_other_other_func(X)
return result
def residuals(self, X, y):
result=self.predict(X)
return result.sub(y,axis='index')
So now I setup my pipeline like this:
pipe=Pipeline([
('transformer', my_transformer(some_arg)),
('Estimator',my_estimator(some_arg))
])
This works well when doing fit and predict:
pipe.fit(X,y)
pipe.predict(X)
However, I need to get the scores
and residuals
for the following steps. I can only get access by:
pipe['transformer'].scores(X)
pipe['estimator'].residuals(Not_X,y)
Here, the .scores
works well, but for .residuals
, I have to put in Not_X=pipe['transformer'].fit_transform(X)
instead of X..
This is troublesome and contradicts the purpose of using the pipeline... So, how should I do it with PipeLine? If PipeLine won't do this, any other suggestions???
Thank you!
How should I do it with Pipeline?
The same way you handle any programming task interfacing with other people's code -- add layers till it does what you want ;)
In particular, the functionality you want isn't that complicated. The hardest part of the problem is figuring out what you want the syntax to look like (it's not obvious for example that you would always want to transform just the first argument of any function you're accessing), and given sklearn's focus on clean docs and simple APIs it's not surprising that they haven't tackled that your exact use case yet.
Something like the following should serve as inspiration (the only really finicky bit is using __getattr__
to override attribute access -- that's the reason we can type some_partial_pipeline.residuals
even though PartialPipeline
doesn't have a residuals
attribute. See the getattr docs):
from sklearn.pipeline import Pipeline as _Pipeline
class PartialPipeline:
"""Represents a sequence of steps without any of the bells and whistles of a sklearn.pipeline.Pipeline"""
def __init__(self, steps):
self._steps = steps
def __getattr__(self, attr_name):
obj = getattr(self._steps[-1][-1], attr_name)
if not callable(obj):
return obj
def _f(X, *args, **kwargs):
for _,v in self._steps[:-1]:
X = v.transform(X)
return obj(X, *args, **kwargs)
return _f
class Pipeline(_Pipeline):
"""Wrapper around sklearn.pipeline.Pipeline allowing easy access to the attributes of its steps"""
def __init__(self, steps, *, memory=None, verbose=False):
self.__steps = dict(steps) # Python >=3.6 for dict ordering
self.__memory = memory
self.__verbose = verbose
super().__init__(steps, memory=memory, verbose=verbose)
def at(self, step_name):
i = list(self.__steps).index(step_name)
return PartialPipeline(list(self.__steps.items())[:i+1])
#
# Example use
#
pipe=Pipeline([
('transformer', my_transformer(some_arg))
, ('estimator', my_estimator(some_arg))
])
scores = pipe.at('transformer').scores(X)
residuals = pipe.at('estimator').residuals(X, y)
# The thing you were already trying
custom_residuals = pipe['estimator'].residuals(X_pretransformed, y)
# E.g., if the estimator were an MLPRegressor
loss_values = pipe.at('estimator').loss_curve_
loss_values = pipe['estimator'].loss_curve_