Search code examples
machine-learningscikit-learnclassificationdecision-tree

Decision Tree Uniqueness sklearn


I have some questions regarding decision tree and random forest classifier.

Question 1: Is a trained Decision Tree unique?

I believe that it should be unique as it maximizes Information Gain over each split. Now if it is unique why there is random_state parameter in decision tree classifier.As it is unique so it will be reproducible every time. So no need for random_state as Decision tree is unique.

Question 2: What does a decision tree actually predict?

While going through random forest algorithm I read that it averages probability of each class from its individual tree, But as far I know decision tree predicts class not the Probability for each class.


Solution

  • Even without checking out the code, you will see this note in the docs:

    The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

    For splitter='best', this is happening here:

    # Draw a feature at random
    f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
                   random_state)
    

    And for your other question, read this:

    ...

    Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.

    ...