Search code examples
rrpart

Inconsistent results from R-part package


I am running R part on the same data set but the order of the columns are different and I am getting different results.

This is my dataset

   Home.Owner Marital.Status Annual.Income Default
1         Yes         Single           125      No
2          No        Married           100      No
3          No         Single            70      No
4         Yes        Married           120      No
5          No       Divorced            95     Yes
6          No        Married            60      No
7         Yes       Divorced           220      No
8          No         Single            85     Yes
9          No        Married            75      No
10         No         Single            90     Yes

This is the code

a<-read.csv("ab.csv")
library(rpart)
library(rpart.plot)
model1<-rpart(Default ~.,data =a,method = "class",minsplit = 1,minbucket = 1 
              ,parms=list(split=c("information") ))

rpart.plot(model1)

results:

enter image description here

#changing column order 
b<-a[,c(4,3,2,1)]

# running same process
model2<-rpart(Default ~.,data =b,method = "class",minsplit = 1,minbucket = 1 
              ,parms=list(split=c("information") ))

rpart.plot(model2)

enter image description here

The only thing that has changed is the order of the columns.


Solution

  • Nothing is wrong here. This happens and I can explain why.

    Notice that the two trees are different from the very first split. That is what we must understand. rpart uses Gini Impurity to decide which variable to use to split the data. One time it used Marital Status and the other it used Annual Income. Look carefully at what happened in each split. When it split on Marital Status, it created two nodes one with 40% of the data and no errors. The other node had 60% of the data and 50% errors. When it split on annual income, it produced exactly the same distribution: one node with 40% of the data and no errors; the other node with 60% of the data and 50% errors. Both of these two splits produce the same Gini impurity. It is a tie between the two attributes. So rpart makes an arbitrary choice between the two. It picks the first one, hence the dependence on the attribute order.