Search code examples
pythonscikit-learnxgboostcross-validation

Why such different answers for the xgboost scikit-learn interface?


I am using xgboost for the first time and trying the two different interfaces. First I get the data:

import xgboost as xgb
import dlib
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
dmatrix = xgb.DMatrix(data=X, label=y)

Now the scikit-learn interface:

xgbr = xgb.XGBRegressor(objective='reg:squarederror', seed=20)
print(cross_val_score(xgbr, X, y, cv=5))

This outputs:

[0.73438184 0.84902986 0.82579692 0.52374618 0.29743001]

Now the xgboost native interface:

dmatrix = xgb.DMatrix(data=X, label=y)
params={'objective':'reg:squarederror'}
cv_results =  xgb.cv(dtrain=dmatrix, params=params, nfold=5, metrics={'rmse'},  seed=20)
print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())

This gives 3.50.

Why are the outputs so different? What am I doing wrong?


Solution

  • First of all, you didn't specify the metric in cross_val_score, therefore you are not calculating RMSE, but rather the estimator's default metric, which is usually just its loss function. You need to specify it for comparable results:

    cross_val_score(xgbr, X, y, cv=5, scoring = 'neg_root_mean_squared_error')
    

    Second, you need to match sklearn's CV procedure exactly. For that, you can pass folds argument to XGBoost's cv method:

    from sklearn.model_selection import KFold
    
    cv_results =  xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5))
    

    Finally, you need to ensure that XGBoost's cv procedure actually converges. For some reason it only does 10 boosting rounds by default, which is too low to converge on your dataset. This is done via nrounds argument (num_boost_round if you're on an older version), I found that 100 rounds work just fine on this dataset:

    cv_results =  xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5), nrounds = 100)
    

    Now you will get matching results.

    On a side note, it's interesting how you say it's your first time using XGBoost, but you actually have a question on XGBoost dating back to 2017.