Search code examples
pythonscikit-learndecision-tree

How can I create a scikit-learn tree by hand?


For testing some code I want to be able to create a sklearn.tree._tree.Tree by hand, rather than by fitting to some data.

For concreteness let's say I want a tree that classifies points in the real line into intervals (-infinity, 5], (5,6] or (6,infinity). I want the tree shaped like

----0----
|        |
|     ---2---
|     |      |
1     3      4

where node 0 splits the real line at 5 and node 2 splits the real line at 6.

How to do this? I see that trees have a __setstate__ method, and looking at the output of __getstate__ it looks like I need something like

state = {
        'n_features_': 1,
        'max_depth': 2,
        'node_count': 5,
        'nodes': np.array([(1 ,   2,  0,  5., 0.375, 3, 3.),
                           (-1,  -1,  0, -2., 0.   , 1, 1.),
                           (3 ,   4,  0,  6., 0.,  , 2, 2.),
                           (-1,  -1,  0, -2., 0.,  , 1, 1.),
                           (-1,  -1,  0, -2., 0.,  , 1, 1.),
                           ],
                          dtype=[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'),('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]),
}

But I don't really understand what these parameters mean and in any case I don't see how to initialize a tree with this state in the first place.


Solution

  • After hours of trying to change by hand nodes. I found a solution. Indeed, you are right. By using the setstate you can do tree customization. The 'node' key must be as follows:

    • numpy array of tuples
    • each tuple must be like the following: (left_child[i], right_child[i], feature[i], threshold[i], impurity[i], n_node_samples[i], weighted_n_node_samples[i])

    The -1 (for left/right child) & -2 (for feature) represents leafs.

    When training a classifier, you'll have other another key: 'value'.