Search code examples
machine-learningdecision-tree

Decision Tree Performance, ML


If we don't give any constraints such as max_depth, minimum number of samples for nodes, Can decision tree always give 0 training error? or it depends on Dataset? What about shown dataset?

edit- it is possible to have a split which results in lower accuracy than parent node, right? According to theory of decision tree it should stop splitting there even if the end results after several splitting can be good! Am I correct?

enter image description here


Solution

  • Decision tree will always find a split that imrpoves accuracy/score

    For example, I've built a decision tree on data similiar to yours:

    Accuracy by depth

    A decision tree can get to 100% accuracy on any data set where there are no 2 samples with the same feature values but different labels.

    This is one reason why decision trees tend to overfit, especially on many features or on categorical data with many options.

    Indeed, sometimes, we prevent a split in a node if the improvement created by the split is not high enough. This is problematic as some relationships, like y=x_1 xor x_2 cannot be expressed by trees with this limitation.

    So commonly, a tree doesn't stop because he cannot improve the model on training data. The reason you don't see trees with 100% accuracy is because we use techniques to reduce overfitting, such as:

    1. Tree pruning like this relatively new example. This basically means that you build your entire tree, but then you go back and prune nodes that did not contribute enough to the model's performance.
    2. Using a ratio instead of gain for the splits. Basically this is a way to express the fact that we expect less improvement from a 50%-50% split than a 10%-90% split.
    3. Setting hyperparameters, such as max_depth and min_samples_leaf, to prevent the tree from splitting too much.