Search code examples
pythonmachine-learningxgboost

Manipulation and interpretation of xgboost models in python


I've been taking a look at the output of booster.save_model("model.json"), and I am having trouble understanding the output. It seems as though almost none of the information in model.json is actually used for prediction, in fact - suspiciously little. For reference, one such model.json looks like this:

j={"learner": {
     "attributes": {},
     "feature_names": [],
     "feature_types": [],
     "gradient_booster": {"model": {"gbtree_model_param": {"num_trees": "1", "size_leaf_vector": "0"}, "tree_info": [0], "trees": [<a single tree>]}, "name": "gbtree"},
     "learner_model_param": {"base_score": "5E-1", "num_class": "0", "num_feature": "5"},
     "objective": {"name": "reg:squarederror", "reg_loss_param": {"scale_pos_weight": "1"}}},
   "version": [1, 4, 2]}

where the single tree under j['learner']['gradient_booster']['model']['trees'] is

{
 "base_weights": [-0.4984156, -1.2707391, 0.37819964, -2.128702, -0.5379327, -0.41528815, 1.2452325, -2.9461422, -1.3161767, -1.317807, 0.3579243, -1.2447615, 0.33945537, 0.5203166, 2.272548],
 "categories": [],
 "categories_nodes": [],
 "categories_segments": [],
 "categories_sizes": [],
 "default_left": [true, true, true, true, true, true, true, false, false, false, false, false, false, false, false],
 "id": 0,
 "left_children": [1, 3, 5, 7, 9, 11, 13, -1, -1, -1, -1, -1, -1, -1, -1],
 "loss_changes": [6771.463, 3341.7627, 3223.7031, 1622.7256, 2004.9153, 1532.3413, 1666.2395, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 "parents": [2147483647, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6],
 "right_children": [2, 4, 6, 8, 10, 12, 14, -1, -1, -1, -1, -1, -1, -1, -1],
 "split_conditions": [0.073486, -0.11132032, 0.041045856, -0.011401389, 0.104938895, -0.05693599, 0.19832665, -0.8838427, -0.39485303, -0.3953421, 0.1073773, -0.37342846, 0.101836614, 0.15609498, 0.6817644],
 "split_indices": [3, 4, 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 "split_type": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 "sum_hessian": [10000.0, 5316.0, 4684.0, 2448.0, 2868.0, 2446.0, 2238.0, 1219.0, 1229.0, 1533.0, 1335.0, 1165.0, 1281.0, 1313.0, 925.0],
 "tree_param": {"num_deleted": "0", "num_feature": "5", "num_nodes": "15", "size_leaf_vector": "0"}
}

Question 1: What is the exact formula for the prediction that a booster makes, as a function of its inputs and these parameters?

I would have expected the prediction to be formed by starting with the base_score and adding the relevant values of base_weights during each traversal, but that doesn't seem to be the case, indeed, it appears that the values of the predictions do not depend on base_weights (or loss_changes or sum_hessian)! Here's a brief demonstration (with xgboost.__version__=1.4.2 and python 3.9.7):

import numpy as np, xgboost, json
def new_model():
    return xgboost.XGBRegressor(n_estimators=1, max_depth=3, base_score=0.5)
def save_model(model, path):
    model.get_booster().save_model(path)
def load_model(path):
    model = new_model()
    model.load_model(path)
    return model

x = np.random.standard_normal((10000, 5))
y = x.sum(1)

m0 = new_model()
m0.fit(x, y)
pred0 = m0.predict(x)
p0 = '/tmp/m0.json'
save_model(m0, p0)
np.testing.assert_array_equal(pred0, load_model(p0).predict(x))  # test save->load

with open(p0) as f:
    j = json.load(f)
trees = j['learner']['gradient_booster']['model']['trees']
for field in ['base_weights', 'loss_changes', 'sum_hessian']:
    trees[0][field] = np.random.random(len(trees[0][field])).tolist()
p1 = '/tmp/m2.json'
with open(p1, 'w') as f:
    json.dump(j, f)
np.testing.assert_array_equal(pred0, load_model(p1).predict(x))  # this assertion passes! Unexpected!

Indeed, the only floating point data that seems to be being used is split_indices, but I would have thought that that was nowhere near enough data to describe a regression tree. So if question 1 is too granular to answer here, there's still...

Question 2: how is it possible that the model predictions depend only on this one floating point vector, split_conditions?


Solution

  • (I see it's nine months too late, but here's a rudimentary answer as other people may be interested in this...)

    split_indices refers to the index (0-based) of the list of features supplied during training. It basically says "At this node (position in the array) use feature N for splitting".

    For split nodes, split_conditions refers to the threshold for splitting -- if feature < split_condition go left, if >= go right. Plus the treatment of NAs (default_left tells you where they go at each split).

    In your example the first split would be based on feature #3 at threshold 0.073486.

    For leaf nodes, the split_condition contains the leaf value, i.e. the prediction for observations falling into that leaf. (With possible caveats depending on type of problem, transformations, etc.) left_children and right_children have a value of -1 for the leaf nodes.

    Hope this helps someone to get started -- there's quite a bit of other details. Some of the info in the json is not needed for prediction but allows to calculate e.g. the feature importance metrics and how the tree was constructed.

    Finally, for me plotting the tree (xgboost.to_graphviz(booster=m0)) helps a lot in interpreting the info in the json. Making an example of a depth=1 (single splitting node, 2 leaf nodes) would be even easier to interpret.