I am just trying a basic example on the Boston dataset. Negative R2
means it's performing worse than just returning the average, I wonder if I am doing something wrong, or how can it perform so badly in sample? How do I fix this?
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree=0.3, learning_rate = 0.1,
max_depth=5, alpha=10, n_estimators=10)
xg_reg.fit(X_train,y_train)
y_train_hat = xg_reg.predict(X_train)
train_r2 = metrics.r2_score(y_true=y_train, y_pred=y_train_hat)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_hat))
print (train_r2, train_rmse, y_train.mean(), y_train_hat.mean())
-0.11469938481461228 10.091020035258527 22.59630606860158 14.59753
Using random forest for example, I was able to get R2=94% in sample and 76% out of sample, so I know I am doing something wrong with xgboost.
You have taken n_estimators
value as 10
which is very small. The default value is 100.
Default settings of xgboost
is quite strong enough to provide you the best result. You don't need to manually select the parameters. Just do
xgb.XGBRegressor()
Better way to choose the parameters is through hyperparameter tuning
which you can do using grid search.
Well after hyperaparameter tuning I found the best value of n_estimator = 1000
with max_depth = 4
.