Search code examples
rtreeregressionrpartcart-analysis

Is there an equivalence of "anova" (for lm) to an rpart object?


When using R's rpart function, I can easily fit a model with it. for example:

# Classification Tree with rpart
library(rpart)

# grow tree 
fit <- rpart(Kyphosis ~ Age + Number + Start,
     method="class", data=kyphosis)

printcp(fit) # display the results 
plotcp(fit) 
summary(fit) # detailed summary of splits

# plot tree 
plot(fit, uniform=TRUE, 
     main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

My question is - How can I measure the "importance" of each of my three explanatory variables (Age, Number, Start) to the model?

If this was a regression model, I could have looked at p-values from the "anova" F-test (between lm models with and without the variable). But what is the equivalence of using "anova" on lm to an rpart object?

(I hope I managed to make my question clear)

Thanks.


Solution

  • Of course anova would be impossible, as anova involves calculating the total variation in the response variable and partitioning it into informative components (SSA, SSE). I can't see how one could calculate sum of squares for a categorical variable like Kyphosis.

    I think that what you actually talking about is Attribute Selection (or evaluation). I would use the information gain measure for example. I think that this is what is used to select the test attribute at each node in the tree and the attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions.

    I am not aware whether there is a method of ranking attributes according to their information gain in R, but I know that there is in WEKA and is named InfoGainAttributeEval It evaluates the worth of an attribute by measuring the information gain with respect to the class. And if you use Ranker as the Search Method, the attributes are ranked by their individual evaluations.

    EDIT I finally found a way to do this in R using Library CORElearn

    estInfGain <- attrEval(Kyphosis ~ ., kyphosis, estimator="InfGain")
    print(estInfGain)