Search code examples
xgboostgbm

Internal node predictions of xgboost model


Is it possible to calculate the internal node predictions of an xgboost model? The R package, gbm, provides a prediction for internal nodes of each tree.

The xgboost output, however only shows predictions for the final leaves of the model.

xgboost output:

Notice that the Quality column has the final prediction for the leaf node in row 6. I would like that value for each of the internal nodes as well.

   Tree Node  ID    Feature    Split  Yes   No Missing     Quality  Cover
1:    0    0 0-0 Sex=female  0.50000  0-1  0-2     0-1 246.6042790 222.75
2:    0    1 0-1        Age 13.00000  0-3  0-4     0-4  22.3424225 144.25
3:    0    2 0-2   Pclass=3  0.50000  0-5  0-6     0-5  60.1275253  78.50
4:    0    3 0-3      SibSp  2.50000  0-7  0-8     0-7  23.6302433   9.25
5:    0    4 0-4       Fare 26.26875  0-9 0-10     0-9  21.4425507 135.00
6:    0    5 0-5       Leaf       NA <NA> <NA>    <NA>   0.1747126  42.50

R gbm output:

In the R gbm package output, the prediction column contains values for both leaf nodes (SplitVar == -1) and the internal nodes. I would like access to these values from the xgboost model

   SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight   Prediction
0         1   0.000000000        1         8          15      32.564591    445  0.001132514
1         2   9.500000000        2         3           7       3.844470    282 -0.085827382
2        -1   0.119585850       -1        -1          -1       0.000000     15  0.119585850
3         0   1.000000000        4         5           6       3.047926    207 -0.092846157
4        -1  -0.118731665       -1        -1          -1       0.000000    165 -0.118731665
5        -1   0.008846912       -1        -1          -1       0.000000     42  0.008846912
6        -1  -0.092846157       -1        -1          -1       0.000000    207 -0.092846157

Question:

How do I access or calculate predictions for the internal nodes of an xgboost model? I would like to use them for a greedy, poor man's version of SHAP scores.


Solution

  • The solution to this problem is to dump the xgboost json object with all_stats=True. That adds the cover statistic to the output which can be used to distribute the leaf points through the internal nodes:

    def _calculate_contribution(node: AnyNode) -> float32:
            if isinstance(node, Leaf):
                return node.contrib
            else:
                return (
                    node.left.cover * Node._calculate_contribution(node.left)
                    + node.right.cover * Node._calculate_contribution(node.right)
                ) / node.cover
    

    The internal contribution is the weighted average of the child contributions. Using this method, the generated results exactly match those returned when calling the predict method with pred_contribs=True and approx_contribs=True.