I am misunderstanding something about the sklearn
DecisionTreeClassifier
. For the following code:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from scipy.stats import entropy
import numpy as np
train_x = np.array([[-0.62144528, 0.37728335],
[-0.46795808, 0.2464509 ],
[-0.31221227, -0.61933418],
[-0.37111143, -0.37863888],
[-0.4473217, -0.15192771],
[-0.55939442, 0.59526016],
[-0.37522823, -0.2779457 ],
[-0.39228952, -0.37050653],
[-0.43533553, -0.02755128],
[-0.45524276, 0.39507087],
[-0.50147608, 0.58797464],
[-0.47677197, -0.64571978],
[-0.41001417, -0.16771494],
[-0.39795968, -0.27224625],
[-0.46032929, 0.24087007],
[-0.50722624, 0.51068014],
[-0.44299732, 0.00477296],
[-0.37282845, -0.68609962],
[-0.40829113, -0.26251665],
[-0.46950366, 0.14817891],
[-0.58785758, 0.25280204],
[-0.45326652, 0.0034019 ],
[-0.41441818, 0.14027937]])
train_y = np.array([[ 7], [ 2], [12], [11], [ 4], [ 7], [10], [11], [ 1], [ 3], [ 7], [10], [ 1], [ 8], [ 3], [ 2], [ 4], [ 5], [ 8], [ 4], [ 7], [ 4], [ 1]])
clf = DecisionTreeClassifier(random_state=0, criterion="entropy", max_depth=1)
clf = clf.fit(train_x, train_y)
values_root, counts_root = np.unique(train_y, return_counts=True)
counts_root = counts_root / len(train_x)
entropy_root = entropy(counts_root, base=2)
train_left = train_y[train_x[:, 1] <= -0.65]
train_right = train_y[train_x[:, 1] > -0.65]
values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)
values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)
entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)
print('Information gain manual split: ', entropy_root - (entropy_1 + entropy_2) / 2)
train_left = train_y[train_x[:, 1] <= clf.tree_.threshold[0]]
train_right = train_y[train_x[:, 1] > clf.tree_.threshold[0]]
values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)
values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)
entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)
print('Information gain sklearn split: ', entropy_root - (entropy_1 + entropy_2) / 2)
plot_tree(clf)
The optimal split found by the DecisionTreeClassifier is y <= -0.215:
This gives an information gain of 3.186 - (2.25 + 2.257) / 2 = 0.933. However, splitting y at for example -0.65, as shown in the code, leads to an IG of 3.186 - (0 + 3.061) / 2 = 1.656. Why does it not find splits with a higher IG like these? min_samples_leaf is set to its default of 1, so that is not the reason. What mistake am I making?
The improvement is measured using average child impurity weighted by the number of samples in each node. See e.g. the equation for $G(Q_m, \theta)$ in this section of the User Guide.
So the decision tree's impurity after the split 2.25*8/23 + 2.257*15/23 = 2.2545
and your candidate has 0*1/23 + 3.061*22/23 = 2.9279