Search code examples
rdecision-treerpartparty

How to filter independent variables in decision-tree in R with rpart or party package


I am a SAS user and currently studying how to make decision tree using R-package.

I do have a good finding associated with each nodes, but now I'm facing 3 questions:

  1. Can I start with a specific variable (top-to-bottom), say, categorical var like gender? ( I did it in FICO-Model builder but now I dont have it anymore)

  2. I have a binary var(gender:1-Male/0-Female), but the nodes split at 0.5?(I tried change it to factor, but didn't work? Also I have a var "AGE", should I change the type to "xxx" instead of "numeric"?)

  3. Based on cp value (below table), I set 0.0128 to prune the tree, but only two vars left, can I choose to keep specific vars?( I do play with the numbers of cp, but the result is not changing )

enter image description here

enter image description here enter image description here

#tree
library(rpart)
library(party)
library(rpart.plot)

#1
minsplit<-60
ct <- rpart.control(xval=10, minsplit=minsplit,minbucket = 
         minsplit/3,cp=0.01)  



 iris_tree <- rpart(Overday_E60dlq ~  .   
                ,

               data= x, method="class",   
               parms = list(prior = c(0.65,0.35), split = "information")
               ,control=ct) 



#plot split.

plot_tris<-rpart.plot(iris_tree, branch=1 , branch.type= 1, type= 2, extra= 
103,  
                  shadow.col="gray", box.col="green",  
                  border.col="blue", split.col="red",  
                  cex=0.65, main="Kyphosis-tree") 

plot_tris
#summary
summary(iris_tree)



#===========prune process=========
printcp(iris_tree)

##  min-xerror cp:  
fitcp<-prune(iris_tree, cp= 
iris_tree$cptable[which.min(iris_tree$cptable[,"xerror"]),"CP"])  


#cp table   
fit2<-prune(fitcp,cp= 0.0128 )

#plot fit2
rpart.plot(fit2, branch=1 , branch.type= 1, type= 2, extra= 103,  
       shadow.col="gray", box.col="green",  
       border.col="blue", split.col="red",  
       cex=0.65, main="Kyphosis fit2") 

Solution

    1. I don't think that one of the more popular tree packages in R has a built-in option for specifying fixed initial splits. Using the partykit package (successor to the party package), however, has infrastructure that can be leveraged to put together such trees with a little bit of effort, see: How to specify split in a decision tree in R programming?
    2. You should use factor variables for unordered categorical covariates (like gender), ordered factors for ordinal covariates, and numeric or integer for numeric covariates. Note that this may not only matter in the visual display but also in the recursive partitioning itself. When using an exhaustive search algorithm like rpart/CART it is not relevant, but for unbiased inference-based algorithms like ctree or mob this may be an important difference.
    3. Cost-complexity pruning does not allow to keep specific covariates. It is a measure for the overall tree, not for individual variables.