Search code examples
pythonmachine-learningscikit-learndecision-tree

Get all values of a terminal (leaf) node in a DecisionTreeRegressor


A decision tree splits nodes until some breaking conditions and uses the mean of the values in any node as prediction.

I would like to get all the values in such a node, not just the mean, to then perform more complex operations. I am using sklearn. I did not find any answers on that, just a way to get the mean of all nodes by using DecisionTreeRegressor.tree_.value.

How to do so?


Solution

  • AFAIK there is not any API method for this, but you can certainly get them programmatically.

    Let's make some dummy data and build a regression tree first to demonstrate this:

    import numpy as np
    from sklearn.tree import DecisionTreeRegressor, export_graphviz
    
    # dummy data
    rng = np.random.RandomState(1)  # for reproducibility
    X = np.sort(5 * rng.rand(80, 1), axis=0)
    y = np.sin(X).ravel()
    y[::5] += 3 * (0.5 - rng.rand(16))
    
    estimator = DecisionTreeRegressor(max_depth=3)
    estimator.fit(X, y)
    
    import graphviz 
    dot_data = export_graphviz(estimator, out_file=None) 
    
    graph = graphviz.Source(dot_data) 
    graph
    

    Here is a plot of our decision tree:

    enter image description here

    from which it is apparent that we have 8 leaves, with the number of samples and the mean of each one depicted.

    The key command here is apply:

    on_leaf = estimator.apply(X)
    on_leaf
    # result:
    array([ 3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  6,  6,  6,  6,  6,  6,
            6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
            6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
           10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 13, 13, 13,
           13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14])
    

    on_leaf has a length equal to our data X and outcomes y; it gives the indices of the nodes where each sample has ended up (all nodes in on_leaf being terminal nodes, i.e. leaves). The number of its unique values is equal to the number or our leaves, here 8:

    len(np.unique(on_leaf))
    # 8
    

    and on_leaf[k] gives the number of node where y[k] ends up.

    Now we can get the actual y values for each one of the 8 leaves as:

    leaves = []
    for i in np.unique(on_leaf):
      leaves.append(y[np.argwhere(on_leaf==i)]) 
    
    len(leaves)
    # 8
    

    Let's verify that, in accordance with our plot, the first leaf has only one sample with the value of -1.149 (since it is a single-sample leaf, the value of the sample is equal to the mean):

    leaves[0]
    # array([[-1.1493464]])
    

    Looks good. What about the 2nd leaf, with 10 samples and a mean value of -0.173?

    leaves[1]
    # result:
    array([[ 0.09131401],
           [ 0.09668352],
           [ 0.13651039],
           [ 0.19403525],
           [-0.12383814],
           [ 0.26365828],
           [ 0.41252216],
           [ 0.44546446],
           [ 0.47215529],
           [-0.26319138]])
    
    len(leaves[1])
    # 10
    
    leaves[1].mean()
    # 0.17253138570808904
    

    And so on - a final check for the last leaf (#7), with 4 samples and a mean of -0.99:

    leaves[7]
    # result:
    array([[-0.99994398],
           [-0.99703245],
           [-0.99170146],
           [-0.9732277 ]])
    
    leaves[7].mean()
    # -0.9904763973694366
    

    To summarize:

    What you need with data X, outcomes y, and a decision tree regressor estimator is:

    on_leaf = estimator.apply(X)
    
    leaves = []
    for i in np.unique(on_leaf):
      leaves.append(y[np.argwhere(on_leaf==i)])