I am a SAS user and currently studying how to make decision tree using R-package.
I do have a good finding associated with each nodes, but now I'm facing 3 questions:
Can I start with a specific variable (top-to-bottom), say, categorical var like gender? ( I did it in FICO-Model builder but now I dont have it anymore)
I have a binary var(gender:1-Male/0-Female), but the nodes split at 0.5?(I tried change it to factor, but didn't work? Also I have a var "AGE", should I change the type to "xxx" instead of "numeric"?)
Based on cp value (below table), I set 0.0128 to prune the tree, but only two vars left, can I choose to keep specific vars?( I do play with the numbers of cp, but the result is not changing )
#tree
library(rpart)
library(party)
library(rpart.plot)
#1
minsplit<-60
ct <- rpart.control(xval=10, minsplit=minsplit,minbucket =
minsplit/3,cp=0.01)
iris_tree <- rpart(Overday_E60dlq ~ .
,
data= x, method="class",
parms = list(prior = c(0.65,0.35), split = "information")
,control=ct)
#plot split.
plot_tris<-rpart.plot(iris_tree, branch=1 , branch.type= 1, type= 2, extra=
103,
shadow.col="gray", box.col="green",
border.col="blue", split.col="red",
cex=0.65, main="Kyphosis-tree")
plot_tris
#summary
summary(iris_tree)
#===========prune process=========
printcp(iris_tree)
## min-xerror cp:
fitcp<-prune(iris_tree, cp=
iris_tree$cptable[which.min(iris_tree$cptable[,"xerror"]),"CP"])
#cp table
fit2<-prune(fitcp,cp= 0.0128 )
#plot fit2
rpart.plot(fit2, branch=1 , branch.type= 1, type= 2, extra= 103,
shadow.col="gray", box.col="green",
border.col="blue", split.col="red",
cex=0.65, main="Kyphosis fit2")
partykit
package (successor to the party
package), however, has infrastructure that can be leveraged to put together such trees with a little bit of effort, see: How to specify split in a decision tree in R programming?factor
variables for unordered categorical covariates (like gender), ordered
factors for ordinal covariates, and numeric
or integer
for numeric covariates. Note that this may not only matter in the visual display but also in the recursive partitioning itself. When using an exhaustive search algorithm like rpart
/CART it is not relevant, but for unbiased inference-based algorithms like ctree
or mob
this may be an important difference.