Search code examples
machine-learningpruning

Cost Complexity Pruning: Pruned error


Can anybody explain this two statements:

In Cost Complexity Pruning, the pruned tree error can never be less than the original tree on the training dataset.

In Cost Complexity Pruning, the pruned tree error can never be less than the original tree on the validation dataset.

The first statement is right while the second is False.


Solution

  • This is true for any pruning strategy you choose, provided that the original tree was constructed to minimize the error in the training set.

    True: the pruned tree error can never be less than the original tree on the training dataset.

    The original tree is as specific as possible, and by replacing a subtree with a leaf node, you can only get a less specific tree. So the error in the training data can either remain the same or increase, never decrease.

    False: the pruned tree error can never be less than the original tree on the validation dataset.

    We assume the validation set is unknown and independent from the training data set. So, as a general rule, you can't make any assumptions of this kind. When pruning, the error on the validation dataset can increase, keep the same, or reduce.

    We expect, however, that the error will reduce, because the tree will become less specific to the training data, and therefore more likely to be compatible with different data sets.