This is an minimal example using XGBClassifier
, but am interested how this would work in general. I am trying to wrap the model class in order to use it in cross validation. In this case I am only weighing the imbalanced classes, but my ultimate goal is a bit broader change in the pipeline.
My first try was to simply override the fit function:
from sklearn import metrics
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
class WeightedXGBClassifier(XGBClassifier, BaseEstimator, ClassifierMixin):
@staticmethod
def get_weights(y):
sample_weights = compute_sample_weight(class_weight='balanced', y=y)
return sample_weights
def fit(self, X, y, **kwargs):
weights = self.get_weights(y)
super(XGBClassifier, self).fit(X, y, sample_weight=weights, **kwargs)
which works fine, when I'm trying to fit the model, use predictions etc.. But using this in sklearn cross_val_score
xgb_model_cv = WeightedXGBClassifier(n_estimators=100, max_depth=4, alpha=100, use_label_encoder=False)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
auc_scorer = metrics.make_scorer(metrics.roc_auc_score, needs_proba=True)
scores = cross_val_score(xgb_model_cv, X, y, scoring=auc_scorer, cv=cv, n_jobs=-1, verbose=1)
throws an error
File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 106, in __call__
score = scorer._score(cached_call, estimator, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 306, in _score
y_pred = self._select_proba_binary(y_pred, clf.classes_)
AttributeError: 'WeightedXGBClassifier' object has no attribute 'classes_'
Now, it is my understanding the classes_
attribute is created, when the model is fitted, but I am not sure how to then properly wrap the model to capture this. Note that running
model = XGBClassifier(use_label_encoder=False, scale_pos_weight=(~y).sum()/y.sum())
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
works fine. My second try was:
class XGBClassifierWrapper(BaseEstimator, ClassifierMixin):
def __init__(self, **kwargs):
# super(BaseEstimator).__init__()
# super(ClassifierMixin).__init__()
self.xgb_classifier_obj = XGBClassifier(**kwargs)
@staticmethod
def get_weights(y):
sample_weights = compute_sample_weight(class_weight='balanced', y=y)
return sample_weights
def fit(self, X, y, **kwargs):
weights = self.get_weights(y)
self.xgb_classifier_obj.fit(X, y, sample_weight=weights, **kwargs)
return self
def predict(self, X, **kwargs):
return self.xgb_classifier_obj.predict(X, **kwargs)
def predict_proba(self, X, **kwargs):
return self.xgb_classifier_obj.predict_proba(X, **kwargs)
which again resulted in the same error as in the case above, i.e., missing classes_
attribute.
(I don't actually get an error when I run any of your code; however, I do get a scores
consisting only of nan
, and adding error_score='raise'
I get your error message.)
In the first approach, I believe the only real problem is in your initialization. super(XGBClassifier, self)
: that's looking for a parent class of XGBClassifier
, and not XGBClassifier
itself, as I assume you want. Replacing with just the vanilla super()
and everything works.
You should also add return self
to the end of fit
in your first attempt, but it's not important here. You can probably safely drop BaseEstimator
and ClassifierMixin
from the inheritance, since XGBClassifier
already inherits from them.
Your second, wrapper, approach just fails because the wrapped xgb_classifier_obj
has all the fitted attributes, including classes_
, but your wrapper doesn't expose that directly. You can just set self.classes_ = self.xgb_classifier_obj.classes_
in fit
, or perhaps define a @property
delegation.
You should also consider that your __init__
this time doesn't meet the sklearn API, so cloning won't work correctly. I'd advise using the first approach for this reason (fixing it requires rather more tedious work, in my opinion).