Search code examples
pythonmachine-learningoutputrandom-forestdecision-tree

Random Forest / Decision Tree Output Probability Design: Using Positive Output Leaf Samples / Total Output Leaf Samples


I am designing a binary classifier random forest model using python and scikitlearn, in which I would like to retrieve the probability of my test set being one of the two labels. To my understanding, predict_proba(xtest) will give me the following result:

Number Of Trees Voted For Classifier / Number Of Trees

I find this too imprecise, as certain tree nodes, may have separated my (non-deterministic) samples, into fairly precise leaves (100 class a , 0 class b), and imprecise leaves (5 class a, 3 class b). I would like an implementation of 'probability' that takes the total number of samples in my n-classifiers output leaves as a dominator and the total number of the overall chosen classifier in the output leaves as the numerator (even for tree's and their output leaves that chose the class most trees didn't).

For example (Simple):

2 Trees:

Tree 1:  
   --- 5, 0 Class A (Chosen)  
10    
   --- 2, 3 Class B (Unchosen)  

Tree 2:  
   --- 3, 2 Class A (Chosen)  
10   
   --- 5, 0 Class B (Unchosen)

predict_proba results:

Number of Trees that chose Class A (2) / Number of Trees (2) = 1.0

Desired Results:

Number of Class A Samples in Output Leaves (8) / Total Number Samples in Output Leaves (10) = 0.8

Does anyone have any knowledge on how to do this, or an implementation they are using?

An idea I had was to iterate through every tree, retrieving their probabilities, and averaging them. However, this would give higher bias to output leaves that have less samples (electoral college style).

How to directly access the number of samples and their classes of the output leaf of a decision tree for a specific sample (Or even just the leaves index, and go from there)? And in the case of a random forest, summing and averaging them?

If not switch platforms/libraries entirely? Or maybe just crank up the number of classifiers (not optimal)?

Some potentially helpful documentation?:

dtc.tree_.n_node_samples
dtc.tree_[node_index].n_node_samples ?

Solution

  • I find this too imprecise, as certain tree nodes, may have separated my (non-deterministic) samples, into fairly precise leaves (100 class a , 0 class b), and imprecise leaves (5 class a, 3 class b).

    As with Decision Trees in a Bagging ensemble, the individual Decision Tree learners in a Random Forest are typically fully expanded and not pruned. This is the preferred methodology of Hastie et al. in ESL. The default settings for an sklearn.ensemble.RandomForestClassifier result in fully expanded trees. What this means is that each tree perfectly fits its training data, and the entropy in the leaf nodes will be zero (assuming that there are no noisy samples with different labels for the same sample points). Therefore with the default settings, your example simply can not occur.

    Otherwise, in the case that you constrain your individual trees by specifying an appropriate max_depth or a min_samples_leaf parameter, what you are asking for is simply a different algorithm than the Random Forest algorithm described by Breiman and Cutler and implemented by sklearn. I am not aware of any library that offers such an interface. However it would be relatively trivial to achieve what you are looking for, especially if you have a good working knowledge of the algorithm. You just need to modify the return type of the Decision Tree's predict method, and then refactor the logic for taking the majority vote in the Random Forest. Minor modifications, assuming you have working implementations of the Decision Tree and Random Forest modules.

    Or maybe just crank up the number of classifiers (not optimal)?

    I'm not sure how this would help you, in general, and in particular with what you seem to be looking to do. With more classifiers, you still will be taking a majority vote in Random Forest, only the vote will be over a larger ensemble. Furthermore, in the limit, you may be worse off: although it is rare that people have issues with a RandomForest ensemble being too big and overfitting, it can happen, and too many trees in the forest can result in a final Random Forest model with excess variance.

    If you would like to work with the sklearn implementation, then I think you should find most of the information you need to get started at Understanding the Decision Tree Structure. For example, the DecisionTreeClassifier has an attribute tree_ that points to an instance of the Tree class, an array-based representation of a binary decision tree. This member allows access to a number of low-level attributes such as max_depth, or of interest to you, tree_.value. As mentioned in the documentation:

    The tree_.value array is a 3D array of shape [n_nodes, n_classes, n_outputs] which provides the count of samples reaching a node for each class and for each output. Each node has a value array which is the number of weighted samples reaching this node for each output and class.

    I would also suggest taking a look at the decision_path method of the RandomForestClassifier and the DecisionTreeClassifier. Lastly, you should ask yourself if you really need this functionality? Do you have a statistical justification for why this modification would improve the classifiers performance? Is there a specific problem you are trying to solve with this approach?