A dataset containing various features and a regression target (called qval) was used to train an XGBoost regressor. This value, qval, is between 0 and 1 and should have the following distribution:
So far, so good. However, when I save the model with xgb.save_model() and re-load it with xgb.load_model() to predict on another dataset this qval, the predicted qval is out of the [0,1] boundary, as shown here.
Could someone explain if this is normal, and if yes, why this is happening? From my perspective, it might be just that the "equation" (very bad word here) that computes the qval was trained on some data and the weights don't really take into account the [0,1] boundary. Therefore, when applying those"weights" to the new data the result is out of bounds. Not entirely sure though.
Yes, xgboost can make predictions outside the training labels range.
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_classification(random_state=42)
gbm = GradientBoostingRegressor(max_depth=1,
n_estimators=10,
learning_rate=1,
random_state=42)
gbm.fit(X,y)
preds = gbm.predict(X)
print(preds.min(), preds.max())
# Output
#-0.010418732339562916 1.134566081403055
This probably means that your test set is different than your training set.
For random forest and decision trees this does not happen.
This phenomenon is related to the boosting ensembling and how it works.