If we don't give any constraints such as max_depth
, minimum number of samples for nodes
, Can decision tree always give 0 training error? or it depends on Dataset? What about shown dataset?
edit- it is possible to have a split which results in lower accuracy than parent node, right? According to theory of decision tree it should stop splitting there even if the end results after several splitting can be good! Am I correct?
Decision tree will always find a split that imrpoves accuracy/score
For example, I've built a decision tree on data similiar to yours:
A decision tree can get to 100% accuracy on any data set where there are no 2 samples with the same feature values but different labels.
This is one reason why decision trees tend to overfit, especially on many features or on categorical data with many options.
Indeed, sometimes, we prevent a split in a node if the improvement created by the split is not high enough. This is problematic as some relationships, like y=x_1 xor x_2
cannot be expressed by trees with this limitation.
So commonly, a tree doesn't stop because he cannot improve the model on training data. The reason you don't see trees with 100% accuracy is because we use techniques to reduce overfitting, such as:
max_depth
and min_samples_leaf
, to prevent the tree from splitting too much.