The docs of sklearn.compose.TransformedTargetRegressor
state that:
regressor object, default=None
Regressor object such as derived from
RegressorMixin
. This regressor will automatically be cloned each time prior to fitting. Ifregressor is None
,LinearRegression
is created and used.
What is the rationale behind cloning the given regressor
each time prior to fitting? Why would this be useful?
This behavior prevents, for example, the following code from working:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
X = np.random.default_rng(seed=1).normal(size=(100,3))
y = np.random.default_rng(seed=1).normal(size=100)
model = RandomForestRegressor()
pipeline = Pipeline(
steps=[
('normalize', StandardScaler()),
('model', model),
],
)
tt = TransformedTargetRegressor(regressor=pipeline, transformer=StandardScaler())
tt.fit(X, y)
print(model.feature_importances_)
It results in:
Traceback (most recent call last):
File "/tmp/test.py", line 21, in <module>
print(model.feature_importances_)
[...]
sklearn.exceptions.NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
which is not surprising given that the model
object is cloned by the TransformedTargetRegressor
.
So, is there a way to prevent this cloning behavior and make the above code work?
All sklearn meta-estimators (except Pipeline
) clone their base estimators; I can't confidently answer why the developers chose that paradigm.
But the fitted base estimators are always made available in new attributes: instead of
model.feature_importances_
use
tt.regressor_['model'].feature_importances_