I tried to subclass XGBRegressor to create a custom scikit-learn compatible estimator with GridSearchCV embedded. I kept receiving the TypeError message saying "super() takes no keyword arguments."
In the context below, the first code is a procedural version of the second code. The second code is what I intended to do but failed: I want to create a new class for XGBoost regressors with GridSearchCV as a cross validator.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# procedural version
X, y = make_regression(n_samples=20, n_features=3, random_state=42)
parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
tunned_regr = GridSearchCV(XGBRegressor(), parameters)
tunned_regr.fit(X, y)
pred_y = tunned_regr.predict(X)
fig, ax = plt.subplots(figsize=(10,6))
plt.scatter(range(len(X)), pred_y, label="predicted")
plt.scatter(range(len(X)), y, label="true")
plt.legend()
# the new xgboost regressor with gridsearchCV embedded
class XGBR(XGBRegressor):
def __init__(self, objective='reg:linear'):
super(XGBR, self).__init__(objective=objective)
def fit(self, X, y):
parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
self.regr = GridSearchCV(super(XGBR, self), parameters)
self.regr.fit(X, y)
return self
def predict(self, X):
return self.regr.predict(X)
Running the following commands xgbr=XGBR(); xgbr.fit(X, y)
, you should see the error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/test.py in <module>
13 return self.regr.predict(X)
14 xgbr = XGBR()
---> 15 xgbr.fit(X, y)
/test.py in fit(self, X, y)
7 parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
8 self.regr = GridSearchCV(super(XGBR, self), parameters)
----> 9 self.regr.fit(X, y)
10 return self
11
~/.local/lib/python3.9/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
803 n_splits = cv_orig.get_n_splits(X, y, groups)
804
--> 805 base_estimator = clone(self.estimator)
806
807 parallel = Parallel(n_jobs=self.n_jobs, pre_dispatch=self.pre_dispatch)
~/.local/lib/python3.9/site-packages/sklearn/base.py in clone(estimator, safe)
80 for name, param in new_object_params.items():
81 new_object_params[name] = clone(param, safe=False)
---> 82 new_object = klass(**new_object_params)
83 params_set = new_object.get_params(deep=False)
84
TypeError: super() takes no keyword arguments
This line looks suspicious to me:
self.regr = GridSearchCV(super(XGBR, self), parameters)
I suspect you want to write the following instead:
self.regr = GridSearchCV(self, parameters)
In the procedural version of your code, you write
tunned_regr = GridSearchCV(XGBRegressor(), parameters)
so you are passing an instance of the XGBRegressor
class as the first parameter of the GridSearchCV
constructor. In your code, XGRB
is a subclass of XGBRegressor
, so self
will be an instance of XGRB
and hence also of XGBRegressor
.
However, after spending more time looking at your code and your question, I'm not sure inheritance is the way to go here.
There is a general maxim in software development of 'prefer composition over inheritance'. There are cases where inheritance is useful, but it tends to be used in places where it isn't the best approach, and I think this is one of those cases.
Is an XGBR also an XGBRegressor? Can you use an instance of your XGBR
class anywhere that an XGBRegressor
can also be used? If the answer to either of these questions is no, then don't use inheritance.
The following version of your class uses composition instead: it creates a XGBRegressor
in the fit()
method. You create it and use it in exactly the same way as before:
class XGBR:
def __init__(self, objective='reg:linear'):
self.objective = objective
def fit(self, X, y):
parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
self.regr = GridSearchCV(XGBRegressor(objective=self.objective), parameters)
self.regr.fit(X, y)
return self
def predict(self, X):
return self.regr.predict(X)
For the time being I've chosen to initialise the XGBRegressor in the call to fit()
. If an XGBRegressor
is slow to create you might wish to create it in __init__
instead. However, if you do this, you would also want to be sure that you can use the same XGBRegressor
to analyse multiple datasets and the analysis of any dataset isn't influenced by any previous datasets an XGBRegressor
has seen. This may or may not be a problem, I don't know.
Finally, I add a disclaimer that I am not a data scientist and I also have not tested this code.