Search code examples
machine-learningxgboostnon-linear-regression

xgboost gives negative R2


I am just trying a basic example on the Boston dataset. Negative R2 means it's performing worse than just returning the average, I wonder if I am doing something wrong, or how can it perform so badly in sample? How do I fix this?

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree=0.3, learning_rate = 0.1,
                max_depth=5, alpha=10, n_estimators=10)
xg_reg.fit(X_train,y_train)
y_train_hat = xg_reg.predict(X_train)
train_r2 = metrics.r2_score(y_true=y_train, y_pred=y_train_hat)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_hat))
print (train_r2, train_rmse, y_train.mean(), y_train_hat.mean())

-0.11469938481461228 10.091020035258527 22.59630606860158 14.59753

Using random forest for example, I was able to get R2=94% in sample and 76% out of sample, so I know I am doing something wrong with xgboost.


Solution

  • You have taken n_estimators value as 10 which is very small. The default value is 100.

    Default settings of xgboost is quite strong enough to provide you the best result. You don't need to manually select the parameters. Just do

     xgb.XGBRegressor()
    

    Better way to choose the parameters is through hyperparameter tuning which you can do using grid search.

    Well after hyperaparameter tuning I found the best value of n_estimator = 1000 with max_depth = 4.