Search code examples
pythonscikit-learnwrapperxgboostcross-validation

Model wrapper for sklearn cross_val_score


This is an minimal example using XGBClassifier, but am interested how this would work in general. I am trying to wrap the model class in order to use it in cross validation. In this case I am only weighing the imbalanced classes, but my ultimate goal is a bit broader change in the pipeline.

My first try was to simply override the fit function:

from sklearn import metrics
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.base import BaseEstimator, ClassifierMixin

class WeightedXGBClassifier(XGBClassifier, BaseEstimator, ClassifierMixin):
    
    @staticmethod
    def get_weights(y):
        sample_weights = compute_sample_weight(class_weight='balanced', y=y)
        return sample_weights
    
    def fit(self, X, y, **kwargs):
        weights = self.get_weights(y)
        super(XGBClassifier, self).fit(X, y, sample_weight=weights, **kwargs)

which works fine, when I'm trying to fit the model, use predictions etc.. But using this in sklearn cross_val_score

xgb_model_cv = WeightedXGBClassifier(n_estimators=100, max_depth=4, alpha=100, use_label_encoder=False)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
auc_scorer = metrics.make_scorer(metrics.roc_auc_score, needs_proba=True)
scores = cross_val_score(xgb_model_cv, X, y, scoring=auc_scorer, cv=cv, n_jobs=-1, verbose=1)

throws an error

File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 106, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 306, in _score
    y_pred = self._select_proba_binary(y_pred, clf.classes_)
AttributeError: 'WeightedXGBClassifier' object has no attribute 'classes_'

Now, it is my understanding the classes_ attribute is created, when the model is fitted, but I am not sure how to then properly wrap the model to capture this. Note that running

model = XGBClassifier(use_label_encoder=False, scale_pos_weight=(~y).sum()/y.sum())
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

works fine. My second try was:

class XGBClassifierWrapper(BaseEstimator, ClassifierMixin):
    def __init__(self, **kwargs):
#         super(BaseEstimator).__init__()
#         super(ClassifierMixin).__init__()
        self.xgb_classifier_obj = XGBClassifier(**kwargs)
    
    @staticmethod
    def get_weights(y):
        sample_weights = compute_sample_weight(class_weight='balanced', y=y)
        return sample_weights
    
    def fit(self, X, y, **kwargs):
        weights = self.get_weights(y)
        self.xgb_classifier_obj.fit(X, y, sample_weight=weights, **kwargs)
        return self
    
    def predict(self, X, **kwargs):
        return self.xgb_classifier_obj.predict(X, **kwargs)
    
    def predict_proba(self, X, **kwargs):
        return self.xgb_classifier_obj.predict_proba(X, **kwargs)

which again resulted in the same error as in the case above, i.e., missing classes_ attribute.


Solution

  • (I don't actually get an error when I run any of your code; however, I do get a scores consisting only of nan, and adding error_score='raise' I get your error message.)

    In the first approach, I believe the only real problem is in your initialization. super(XGBClassifier, self): that's looking for a parent class of XGBClassifier, and not XGBClassifier itself, as I assume you want. Replacing with just the vanilla super() and everything works.

    You should also add return self to the end of fit in your first attempt, but it's not important here. You can probably safely drop BaseEstimator and ClassifierMixin from the inheritance, since XGBClassifier already inherits from them.

    Your second, wrapper, approach just fails because the wrapped xgb_classifier_obj has all the fitted attributes, including classes_, but your wrapper doesn't expose that directly. You can just set self.classes_ = self.xgb_classifier_obj.classes_ in fit, or perhaps define a @property delegation.

    You should also consider that your __init__ this time doesn't meet the sklearn API, so cloning won't work correctly. I'd advise using the first approach for this reason (fixing it requires rather more tedious work, in my opinion).