Search code examples

XGBoostError: Check failed: typestr.size() == 3 (2 vs. 3) : `typestr' should be of format <endian><type><size of type in bytes>

I'm having a weird issue with a new installation of xgboost. Under normal circumstances it works fine. However, when I use the model in the following function it gives the error in the title.

The dataset I'm using is borrowed from kaggle, and can be seen here:

The function I use to fit my model is the following:

def get_val_scores(model, X, y, return_test_score=False, return_importances=False, random_state=42, randomize=True, cv=5, test_size=0.2, val_size=0.2, use_kfold=False, return_folds=False, stratify=True):
    print("Splitting data into training and test sets")
    if randomize:
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=True, random_state=random_state)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=False)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
    print(f"Shape of training data, X: {X_train.shape}, y: {y_train.shape}.  Test, X: {X_test.shape}, y: {y_test.shape}")
    if use_kfold:
        val_scores = cross_val_score(model, X=X_train, y=y_train, cv=cv)
        print("Further splitting training data into validation sets")
        if randomize:
            if stratify:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train, shuffle=True)
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=True)
            if stratify:
                print("Warning! You opted to both stratify your training data and to not randomize it.  These settings are incompatible with scikit-learn.  Stratifying the data, but shuffle is being set to True")
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train,  shuffle=True)
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=False)
        print(f"Shape of training data, X: {X_train_.shape}, y: {y_train_.shape}.  Val, X: {X_val.shape}, y: {y_val.shape}")
        print("Getting ready to fit model."), y_train_)
        val_score = model.score(X_val, y_val)
    if return_importances:
        if hasattr(model, 'steps'):
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
      , y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
      , y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
    mod_scores = {}
        mod_scores['validation_score'] = val_scores.mean()
        if return_folds:
            mod_scores['fold_scores'] = val_scores
        mod_scores['validation_score'] = val_score
    if return_test_score:
        mod_scores['test_score'] =  model.score(X_test, y_test)
    if return_importances:
        return mod_scores, feats
        return mod_scores

The weird part that I'm running into is that if I create a pipeline in sklearn, it works on the dataset outside of the function, but not within it. For example:

from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from xgboost import XGBClassifier

pipe = make_pipeline(OrdinalEncoder(), XGBClassifier())

X = df.drop('state', axis=1)
y = df['state']

In this case,, y) works just fine. But get_val_scores(pipe, X, y) fails with the error message in the title. What's weirder is that get_val_scores(pipe, X, y) seems to work with other datasets, like Titanic. The error occurs as the model is fitting on X_train and y_train.

In this case the loss function is binary:logistic, and the state column has the values successful and failed.


  • xgboost library is currently under updating to fix this bug, so the current solution is to downgrade the library to older versions, for me I have solved this problem by downgrading to xgboost v0.90

    Try to check your xgboost version by cmd:

    import xgboost

    If the version was not 0.90 then uninstall the current version by:

    pip uninstall xgboost

    Install xgboost version 0.90

    pip install xgboost==0.90

    run your code again!