Search code examples
pythonmachine-learningxgboost

Summing the values of leafs in XGBRegressor trees do not match prediction


It was my understanding that the final prediction of an XGBoost model (in this particular case an XGBRegressor) was obtained by summing the values of the predicted leaves [1] [2]. Yet I'm failing to match the prediction summing the values. Here is a MRE:

import json
from collections import deque

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import xgboost as xgb


def leafs_vector(tree):
    """Returns a vector of nodes for each tree, only leafs are different of 0"""

    stack = deque([tree])

    while stack:
        node = stack.popleft()
        if "leaf" in node:
            yield node["leaf"]
        else:
            yield 0
            for child in node["children"]:
                stack.append(child)


# Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the XGBoost regressor model
xg_reg = xgb.XGBRegressor(objective='reg:squarederror',
                          max_depth=5,
                          n_estimators=10)

# Train the model
xg_reg.fit(X_train, y_train)

# Compute the original predictions
y_pred = xg_reg.predict(X_test)

# get the index of each predicted leaf
predicted_leafs_indices = xg_reg.get_booster().predict(xgb.DMatrix(X_test), pred_leaf=True).astype(np.int32)

# get the trees
trees = xg_reg.get_booster().get_dump(dump_format="json")
trees = [json.loads(tree) for tree in trees]

# get a vector of nodes (ordered by node id)
leafs = [list(leafs_vector(tree)) for tree in trees]

l_pred = []
for pli in predicted_leafs_indices:
    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)))

assert np.allclose(np.array(l_pred), y_pred, atol=0.5) # fails

I also tried adding the default value (0.5) of the base_score (as written here) to the total sum but it also didn't work.

l_pred = []
for pli in predicted_leafs_indices:
    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)) + 0.5) 

Solution

  • The problem is that even is the parameter base_score of the model is None, it can have a base_score (different of the default one) [1].

    In addition, the model.base_score will continue to be None as discussed in #8634. In summary, the Python attribute base_score is a user parameter according to the sklearn interface and should not be changed by the library itself. To see the configured base score, you either have to use the

    To access the base_score value the following works in version 2.0.3 of XGBoost

    config = json.loads(model.get_booster().save_config())
    base_score = float(config["learner"]["learner_model_param"]["base_score"])
    

    Adding the base_score to the total sum make it match the predicted value