Search code examples
pythonscikit-learndecision-tree

DecisionTreeClassifier does not select optimal split


I am misunderstanding something about the sklearn DecisionTreeClassifier. For the following code:

from sklearn.tree import DecisionTreeClassifier, plot_tree
from scipy.stats import entropy
import numpy as np

train_x = np.array([[-0.62144528,  0.37728335],
 [-0.46795808,  0.2464509 ],
 [-0.31221227, -0.61933418],
 [-0.37111143, -0.37863888],
 [-0.4473217,  -0.15192771],
 [-0.55939442,  0.59526016],
 [-0.37522823, -0.2779457 ],
 [-0.39228952, -0.37050653],
 [-0.43533553, -0.02755128],
 [-0.45524276,  0.39507087],
 [-0.50147608,  0.58797464],
 [-0.47677197, -0.64571978],
 [-0.41001417, -0.16771494],
 [-0.39795968, -0.27224625],
 [-0.46032929,  0.24087007],
 [-0.50722624,  0.51068014],
 [-0.44299732,  0.00477296],
 [-0.37282845, -0.68609962],
 [-0.40829113, -0.26251665],
 [-0.46950366,  0.14817891],
 [-0.58785758,  0.25280204],
 [-0.45326652,  0.0034019 ],
 [-0.41441818,  0.14027937]])

train_y = np.array([[ 7], [ 2], [12], [11], [ 4], [ 7], [10], [11], [ 1], [ 3], [ 7], [10], [ 1], [ 8], [ 3], [ 2], [ 4], [ 5], [ 8], [ 4], [ 7], [ 4], [ 1]])

clf = DecisionTreeClassifier(random_state=0, criterion="entropy", max_depth=1)
clf = clf.fit(train_x, train_y)

values_root, counts_root = np.unique(train_y, return_counts=True)
counts_root = counts_root / len(train_x)
entropy_root = entropy(counts_root, base=2)

train_left = train_y[train_x[:, 1] <= -0.65]
train_right = train_y[train_x[:, 1] > -0.65]

values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)

values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)

entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)

print('Information gain manual split: ', entropy_root - (entropy_1 + entropy_2) / 2)

train_left = train_y[train_x[:, 1] <= clf.tree_.threshold[0]]
train_right = train_y[train_x[:, 1] > clf.tree_.threshold[0]]

values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)

values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)

entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)

print('Information gain sklearn split: ', entropy_root - (entropy_1 + entropy_2) / 2)

plot_tree(clf)

The optimal split found by the DecisionTreeClassifier is y <= -0.215: enter image description here

This gives an information gain of 3.186 - (2.25 + 2.257) / 2 = 0.933. However, splitting y at for example -0.65, as shown in the code, leads to an IG of 3.186 - (0 + 3.061) / 2 = 1.656. Why does it not find splits with a higher IG like these? min_samples_leaf is set to its default of 1, so that is not the reason. What mistake am I making?


Solution

  • The improvement is measured using average child impurity weighted by the number of samples in each node. See e.g. the equation for $G(Q_m, \theta)$ in this section of the User Guide.

    So the decision tree's impurity after the split 2.25*8/23 + 2.257*15/23 = 2.2545 and your candidate has 0*1/23 + 3.061*22/23 = 2.9279