python scikit-learn xgboost python-class

Subclassing XGBoostRegressor for scikit-learn estimators receives "TypeError: super() takes no keyword arguments."

I tried to subclass XGBRegressor to create a custom scikit-learn compatible estimator with GridSearchCV embedded. I kept receiving the TypeError message saying "super() takes no keyword arguments."

In the context below, the first code is a procedural version of the second code. The second code is what I intended to do but failed: I want to create a new class for XGBoost regressors with GridSearchCV as a cross validator.

from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt

# procedural version
X, y = make_regression(n_samples=20, n_features=3, random_state=42)
parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
tunned_regr = GridSearchCV(XGBRegressor(), parameters)
tunned_regr.fit(X, y)

pred_y = tunned_regr.predict(X)
fig, ax = plt.subplots(figsize=(10,6))
plt.scatter(range(len(X)), pred_y, label="predicted")
plt.scatter(range(len(X)), y, label="true")
plt.legend()

# the new xgboost regressor with gridsearchCV embedded
class XGBR(XGBRegressor):
    def __init__(self, objective='reg:linear'):
        super(XGBR, self).__init__(objective=objective)

    def fit(self, X, y):
        parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
        self.regr = GridSearchCV(super(XGBR, self), parameters)
        self.regr.fit(X, y)
        return self
    
    def predict(self, X):
        return self.regr.predict(X)

Running the following commands xgbr=XGBR(); xgbr.fit(X, y), you should see the error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/test.py in <module>
     13         return self.regr.predict(X)
     14 xgbr = XGBR()
---> 15 xgbr.fit(X, y)

/test.py in fit(self, X, y)
      7         parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
      8         self.regr = GridSearchCV(super(XGBR, self), parameters)
----> 9         self.regr.fit(X, y)
     10         return self
     11 

~/.local/lib/python3.9/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    803         n_splits = cv_orig.get_n_splits(X, y, groups)
    804 
--> 805         base_estimator = clone(self.estimator)
    806 
    807         parallel = Parallel(n_jobs=self.n_jobs, pre_dispatch=self.pre_dispatch)

~/.local/lib/python3.9/site-packages/sklearn/base.py in clone(estimator, safe)
     80     for name, param in new_object_params.items():
     81         new_object_params[name] = clone(param, safe=False)
---> 82     new_object = klass(**new_object_params)
     83     params_set = new_object.get_params(deep=False)
     84 

TypeError: super() takes no keyword arguments

Solution

This line looks suspicious to me:

        self.regr = GridSearchCV(super(XGBR, self), parameters)

I suspect you want to write the following instead:

        self.regr = GridSearchCV(self, parameters)

In the procedural version of your code, you write

tunned_regr = GridSearchCV(XGBRegressor(), parameters)

so you are passing an instance of the XGBRegressor class as the first parameter of the GridSearchCV constructor. In your code, XGRB is a subclass of XGBRegressor, so self will be an instance of XGRB and hence also of XGBRegressor.

However, after spending more time looking at your code and your question, I'm not sure inheritance is the way to go here.

There is a general maxim in software development of 'prefer composition over inheritance'. There are cases where inheritance is useful, but it tends to be used in places where it isn't the best approach, and I think this is one of those cases.

Is an XGBR also an XGBRegressor? Can you use an instance of your XGBR class anywhere that an XGBRegressor can also be used? If the answer to either of these questions is no, then don't use inheritance.

The following version of your class uses composition instead: it creates a XGBRegressor in the fit() method. You create it and use it in exactly the same way as before:

class XGBR:
    def __init__(self, objective='reg:linear'):
        self.objective = objective

    def fit(self, X, y):
        parameters = {'n_estimators': [10, 20], 'max_depth': [3, 4]}
        self.regr = GridSearchCV(XGBRegressor(objective=self.objective), parameters)
        self.regr.fit(X, y)
        return self
    
    def predict(self, X):
        return self.regr.predict(X)

For the time being I've chosen to initialise the XGBRegressor in the call to fit(). If an XGBRegressor is slow to create you might wish to create it in __init__ instead. However, if you do this, you would also want to be sure that you can use the same XGBRegressor to analyse multiple datasets and the analysis of any dataset isn't influenced by any previous datasets an XGBRegressor has seen. This may or may not be a problem, I don't know.

Finally, I add a disclaimer that I am not a data scientist and I also have not tested this code.