Search code examples
pythonmachine-learningxgboost

Why does XGBoost with datasets of zeros return a non-zero prediction?


I recently developed a fully-functioning random forest regression SW with scikit-learn RandomForestRegressor model and now I'm interested in comparing its performance with other libraries. So I found a scikit-learn API for XGBoost random forest regression and I made a little SW test with an X feature and Y datasets of all zeros.

from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor


tree_number = 100
depth = 10
jobs = 1
dimension = 19
sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                               n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
                         n_jobs=jobs)
dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])

sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)
sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))

Surprisingly the prediction result with an input sample of all zeros for xgb_VAL model is non-zero:

sk_prediction = [0.]
xgb_prediction = [0.02500369]

What is the error in my evaluation or in construction of the comparison for which I have this result?


Solution

  • It seems that XGBoost includes a global bias in the model, and that this is fixed at 0.5 rather than being calculated based on the input data. This has been raised as an issue in the XGBoost GitHub repository (see https://github.com/dmlc/xgboost/issues/799). The corresponding hyperparameter is base_score, if you set it equal to zero your model will predict zero as expected.

    from numpy import array
    from xgboost import XGBRFRegressor
    from sklearn.ensemble import RandomForestRegressor
    
    tree_number = 100
    depth = 10
    jobs = 1
    dimension = 19
    
    sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)
    xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)
    
    dataset = array([[0.0] * dimension, [0.0] * dimension])
    y_val = array([0.0, 0.0])
    
    sk_VAL.fit(dataset, y_val)
    xgb_VAL.fit(dataset, y_val)
    
    sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
    xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
    
    print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))
    #sk_prediction = [0.]
    #xgb_prediction = [0.]