When using R's rpart
function, I can easily fit a model with it. for example:
# Classification Tree with rpart
library(rpart)
# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)
printcp(fit) # display the results
plotcp(fit)
summary(fit) # detailed summary of splits
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
My question is - How can I measure the "importance" of each of my three explanatory variables (Age, Number, Start) to the model?
If this was a regression model, I could have looked at p-values from the "anova" F-test (between lm
models with and without the variable). But what is the equivalence of using "anova" on lm
to an rpart
object?
(I hope I managed to make my question clear)
Thanks.
Of course anova would be impossible, as anova involves calculating the total variation in the response variable and partitioning it into informative components (SSA, SSE). I can't see how one could calculate sum of squares for a categorical variable like Kyphosis.
I think that what you actually talking about is Attribute Selection (or evaluation). I would use the information gain
measure for example. I think that this is what is used to select the test attribute at each node in the tree and the attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions.
I am not aware whether there is a method of ranking attributes according to their information gain in R, but I know that there is in WEKA and is named InfoGainAttributeEval It evaluates the worth of an attribute by measuring the information gain with respect to the class. And if you use Ranker
as the Search Method
, the attributes are ranked by their individual evaluations.
EDIT
I finally found a way to do this in R using Library CORElearn
estInfGain <- attrEval(Kyphosis ~ ., kyphosis, estimator="InfGain")
print(estInfGain)