I've been taking a look at the output of booster.save_model("model.json")
, and I am having trouble understanding the output. It seems as though almost none of the information in model.json
is actually used for prediction, in fact - suspiciously little. For reference, one such model.json looks like this:
j={"learner": {
"attributes": {},
"feature_names": [],
"feature_types": [],
"gradient_booster": {"model": {"gbtree_model_param": {"num_trees": "1", "size_leaf_vector": "0"}, "tree_info": [0], "trees": [<a single tree>]}, "name": "gbtree"},
"learner_model_param": {"base_score": "5E-1", "num_class": "0", "num_feature": "5"},
"objective": {"name": "reg:squarederror", "reg_loss_param": {"scale_pos_weight": "1"}}},
"version": [1, 4, 2]}
where the single tree under j['learner']['gradient_booster']['model']['trees']
is
{
"base_weights": [-0.4984156, -1.2707391, 0.37819964, -2.128702, -0.5379327, -0.41528815, 1.2452325, -2.9461422, -1.3161767, -1.317807, 0.3579243, -1.2447615, 0.33945537, 0.5203166, 2.272548],
"categories": [],
"categories_nodes": [],
"categories_segments": [],
"categories_sizes": [],
"default_left": [true, true, true, true, true, true, true, false, false, false, false, false, false, false, false],
"id": 0,
"left_children": [1, 3, 5, 7, 9, 11, 13, -1, -1, -1, -1, -1, -1, -1, -1],
"loss_changes": [6771.463, 3341.7627, 3223.7031, 1622.7256, 2004.9153, 1532.3413, 1666.2395, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
"parents": [2147483647, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6],
"right_children": [2, 4, 6, 8, 10, 12, 14, -1, -1, -1, -1, -1, -1, -1, -1],
"split_conditions": [0.073486, -0.11132032, 0.041045856, -0.011401389, 0.104938895, -0.05693599, 0.19832665, -0.8838427, -0.39485303, -0.3953421, 0.1073773, -0.37342846, 0.101836614, 0.15609498, 0.6817644],
"split_indices": [3, 4, 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
"split_type": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"sum_hessian": [10000.0, 5316.0, 4684.0, 2448.0, 2868.0, 2446.0, 2238.0, 1219.0, 1229.0, 1533.0, 1335.0, 1165.0, 1281.0, 1313.0, 925.0],
"tree_param": {"num_deleted": "0", "num_feature": "5", "num_nodes": "15", "size_leaf_vector": "0"}
}
Question 1: What is the exact formula for the prediction that a booster makes, as a function of its inputs and these parameters?
I would have expected the prediction to be formed by starting with the base_score and adding the relevant values of base_weights
during each traversal, but that doesn't seem to be the case, indeed, it appears that the values of the predictions do not depend on base_weights
(or loss_changes
or sum_hessian
)! Here's a brief demonstration (with xgboost.__version__=1.4.2
and python 3.9.7):
import numpy as np, xgboost, json
def new_model():
return xgboost.XGBRegressor(n_estimators=1, max_depth=3, base_score=0.5)
def save_model(model, path):
model.get_booster().save_model(path)
def load_model(path):
model = new_model()
model.load_model(path)
return model
x = np.random.standard_normal((10000, 5))
y = x.sum(1)
m0 = new_model()
m0.fit(x, y)
pred0 = m0.predict(x)
p0 = '/tmp/m0.json'
save_model(m0, p0)
np.testing.assert_array_equal(pred0, load_model(p0).predict(x)) # test save->load
with open(p0) as f:
j = json.load(f)
trees = j['learner']['gradient_booster']['model']['trees']
for field in ['base_weights', 'loss_changes', 'sum_hessian']:
trees[0][field] = np.random.random(len(trees[0][field])).tolist()
p1 = '/tmp/m2.json'
with open(p1, 'w') as f:
json.dump(j, f)
np.testing.assert_array_equal(pred0, load_model(p1).predict(x)) # this assertion passes! Unexpected!
Indeed, the only floating point data that seems to be being used is split_indices
, but I would have thought that that was nowhere near enough data to describe a regression tree. So if question 1 is too granular to answer here, there's still...
Question 2: how is it possible that the model predictions depend only on this one floating point vector, split_conditions
?
(I see it's nine months too late, but here's a rudimentary answer as other people may be interested in this...)
split_indices
refers to the index (0-based) of the list of features supplied during training. It basically says "At this node (position in the array) use feature N for splitting".
For split nodes, split_conditions
refers to the threshold for splitting -- if feature < split_condition
go left, if >=
go right. Plus the treatment of NAs (default_left
tells you where they go at each split).
In your example the first split would be based on feature #3 at threshold 0.073486.
For leaf nodes, the split_condition
contains the leaf value, i.e. the prediction for observations falling into that leaf. (With possible caveats depending on type of problem, transformations, etc.)
left_children
and right_children
have a value of -1 for the leaf nodes.
Hope this helps someone to get started -- there's quite a bit of other details. Some of the info in the json
is not needed for prediction but allows to calculate e.g. the feature importance metrics and how the tree was constructed.
Finally, for me plotting the tree (xgboost.to_graphviz(booster=m0)
) helps a lot in interpreting the info in the json
.
Making an example of a depth=1 (single splitting node, 2 leaf nodes) would be even easier to interpret.