A decision tree splits nodes until some breaking conditions and uses the mean of the values in any node as prediction.
I would like to get all the values in such a node, not just the mean, to then perform more complex operations. I am using sklearn. I did not find any answers on that, just a way to get the mean of all nodes by using DecisionTreeRegressor.tree_.value
.
How to do so?
AFAIK there is not any API method for this, but you can certainly get them programmatically.
Let's make some dummy data and build a regression tree first to demonstrate this:
import numpy as np
from sklearn.tree import DecisionTreeRegressor, export_graphviz
# dummy data
rng = np.random.RandomState(1) # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))
estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)
import graphviz
dot_data = export_graphviz(estimator, out_file=None)
graph = graphviz.Source(dot_data)
graph
Here is a plot of our decision tree:
from which it is apparent that we have 8 leaves, with the number of samples and the mean of each one depicted.
The key command here is apply
:
on_leaf = estimator.apply(X)
on_leaf
# result:
array([ 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14])
on_leaf
has a length equal to our data X
and outcomes y
; it gives the indices of the nodes where each sample has ended up (all nodes in on_leaf
being terminal nodes, i.e. leaves). The number of its unique values is equal to the number or our leaves, here 8:
len(np.unique(on_leaf))
# 8
and on_leaf[k]
gives the number of node where y[k]
ends up.
Now we can get the actual y
values for each one of the 8 leaves as:
leaves = []
for i in np.unique(on_leaf):
leaves.append(y[np.argwhere(on_leaf==i)])
len(leaves)
# 8
Let's verify that, in accordance with our plot, the first leaf has only one sample with the value of -1.149
(since it is a single-sample leaf, the value of the sample is equal to the mean):
leaves[0]
# array([[-1.1493464]])
Looks good. What about the 2nd leaf, with 10 samples and a mean value of -0.173
?
leaves[1]
# result:
array([[ 0.09131401],
[ 0.09668352],
[ 0.13651039],
[ 0.19403525],
[-0.12383814],
[ 0.26365828],
[ 0.41252216],
[ 0.44546446],
[ 0.47215529],
[-0.26319138]])
len(leaves[1])
# 10
leaves[1].mean()
# 0.17253138570808904
And so on - a final check for the last leaf (#7), with 4 samples and a mean of -0.99
:
leaves[7]
# result:
array([[-0.99994398],
[-0.99703245],
[-0.99170146],
[-0.9732277 ]])
leaves[7].mean()
# -0.9904763973694366
What you need with data X
, outcomes y
, and a decision tree regressor estimator
is:
on_leaf = estimator.apply(X)
leaves = []
for i in np.unique(on_leaf):
leaves.append(y[np.argwhere(on_leaf==i)])