I am running R part on the same data set but the order of the columns are different and I am getting different results.
This is my dataset
Home.Owner Marital.Status Annual.Income Default
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
This is the code
a<-read.csv("ab.csv")
library(rpart)
library(rpart.plot)
model1<-rpart(Default ~.,data =a,method = "class",minsplit = 1,minbucket = 1
,parms=list(split=c("information") ))
rpart.plot(model1)
results:
#changing column order
b<-a[,c(4,3,2,1)]
# running same process
model2<-rpart(Default ~.,data =b,method = "class",minsplit = 1,minbucket = 1
,parms=list(split=c("information") ))
rpart.plot(model2)
The only thing that has changed is the order of the columns.
Nothing is wrong here. This happens and I can explain why.
Notice that the two trees are different from the very first split. That is what we must understand. rpart
uses Gini Impurity to decide which variable to use to split the data. One time it used Marital Status
and the other it used Annual Income
. Look carefully at what happened in each split. When it split on Marital Status
, it created two nodes one with 40% of the data and no errors. The other node had 60% of the data and 50% errors. When it split on annual income
, it produced exactly the same distribution: one node with 40% of the data and no errors; the other node with 60% of the data and 50% errors. Both of these two splits produce the same Gini impurity. It is a tie between the two attributes. So rpart
makes an arbitrary choice between the two. It picks the first one, hence the dependence on the attribute order.