I've been recently working with RPART and ran into a calculation I don't understand.
When working with information gain, how is "improve" or variable importance calculated (they seem to be the same from my tests).
As a dummy example, I tried learning the following table:
happy,class
yes,p
no,n
with the command:
fit <-rpart(class ~ happy,data=train,parms = list(split="information"),minsplit=0)
It's simple, and returns the expected tree with the root and then each leaf containing one element.
Where this gets confusing, is that the improvement given for the split is 1.386294.
I would expect the improvement here to be 1 (going from entropy 1 to entropy 0 in the children), what am I missing?
Well, to answer this one, it's because RPART is using the natural log.
Thus, it seems that the improve score is the improvement in the entropy scaled by the number of elements in the node.
The entropy in the root node is : -ln(1/2)*1/2*2 + -ln(1/2)*1/2*2 = -ln(1/2)*2 1.38. The entropy in the leaf nodes is both 0.
Why they use natural log, I have no idea.