r machine-learning classification decision-tree rpart

rpart() decision tree fails to generate splits (decision tree with only one node (the root node))

I'm trying to create a decision tree to predict whether a given loan applicant would default or repay their debt.

I'm using the following dataset

library(readr)
library(dplyr)
library(rpart)
library(rpart.plot)

loans <- read_csv('https://assets.datacamp.com/production/repositories/718/datasets/7805fceacfb205470c0e8800d4ffc37c6944b30c/loans.csv')

Since the response variable default is encoded as dbl, I convert it to chr first and then fct type variable to use it in my classification model.

loans <- loans %>% mutate(default = factor(as.character(default), levels = c(0, 1), labels = c('repaid', 'defaulted')))

Now, I start building the recursive partitioning (rpart()) object, loans_model: The response variable is default and the explanatory variables are loan_amount + credit_score + debt_to_income.

loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income, data = loans, method = 'class')

When I make predictions with this model, all the predicted values get the same value, repaid.

loans$pred_default <- predict(loans_model, newdata = loans, type = "class")

unique(unique(loans$pred_default)

Output:

[1] repaid
Levels: repaid defaulted

Also when I try to visualize the decision tree, I get only one node (the root).

rpart.plot(loan_model)

Why does the model I built not make appropriate predictions?

Solution

You need to tinker with the cp argument (complexity parameter), which controls the process of splitting each variable. The default is 0.01. If you set this to -1, and set the maxdepth argument to 3, then you get something more interesting, at least for a start.

loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income, 
                     data = loans, 
                     method = 'class',
                     cp=-1,
                     maxdepth = 3)

rpart.plot(loans_model, cex=0.7)

On page 21 of the longintro.pdf, "The default value (for cp) of .01 has been reasonably successful at ‘pre-pruning’ trees so that the cross-validation step need only remove 1 or 2 layers, but it sometimes over prunes, particularly for large data sets."