python machine-learning scikit-learn decision-tree

Get all values of a terminal (leaf) node in a DecisionTreeRegressor

A decision tree splits nodes until some breaking conditions and uses the mean of the values in any node as prediction.

I would like to get all the values in such a node, not just the mean, to then perform more complex operations. I am using sklearn. I did not find any answers on that, just a way to get the mean of all nodes by using DecisionTreeRegressor.tree_.value.

How to do so?

Solution

AFAIK there is not any API method for this, but you can certainly get them programmatically.

Let's make some dummy data and build a regression tree first to demonstrate this:

import numpy as np
from sklearn.tree import DecisionTreeRegressor, export_graphviz

# dummy data
rng = np.random.RandomState(1)  # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)

import graphviz 
dot_data = export_graphviz(estimator, out_file=None) 

graph = graphviz.Source(dot_data) 
graph

Here is a plot of our decision tree:

from which it is apparent that we have 8 leaves, with the number of samples and the mean of each one depicted.

The key command here is apply:

on_leaf = estimator.apply(X)
on_leaf
# result:
array([ 3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
       10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 13, 13, 13,
       13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14])

on_leaf has a length equal to our data X and outcomes y; it gives the indices of the nodes where each sample has ended up (all nodes in on_leaf being terminal nodes, i.e. leaves). The number of its unique values is equal to the number or our leaves, here 8:

len(np.unique(on_leaf))
# 8

and on_leaf[k] gives the number of node where y[k] ends up.

Now we can get the actual y values for each one of the 8 leaves as:

leaves = []
for i in np.unique(on_leaf):
  leaves.append(y[np.argwhere(on_leaf==i)]) 

len(leaves)
# 8

Let's verify that, in accordance with our plot, the first leaf has only one sample with the value of -1.149 (since it is a single-sample leaf, the value of the sample is equal to the mean):

leaves[0]
# array([[-1.1493464]])

Looks good. What about the 2nd leaf, with 10 samples and a mean value of -0.173?

leaves[1]
# result:
array([[ 0.09131401],
       [ 0.09668352],
       [ 0.13651039],
       [ 0.19403525],
       [-0.12383814],
       [ 0.26365828],
       [ 0.41252216],
       [ 0.44546446],
       [ 0.47215529],
       [-0.26319138]])

len(leaves[1])
# 10

leaves[1].mean()
# 0.17253138570808904

And so on - a final check for the last leaf (#7), with 4 samples and a mean of -0.99:

leaves[7]
# result:
array([[-0.99994398],
       [-0.99703245],
       [-0.99170146],
       [-0.9732277 ]])

leaves[7].mean()
# -0.9904763973694366

To summarize:

What you need with data X, outcomes y, and a decision tree regressor estimator is:

on_leaf = estimator.apply(X)

leaves = []
for i in np.unique(on_leaf):
  leaves.append(y[np.argwhere(on_leaf==i)])