Search code examples
rtestingdecision-treesample

How to divide train and test datasets into ratios in R for a decision tree?


Here's the instructions in the homework and I have two separate code implementations but I'm having trouble interpreting what the instructions are asking:

Use the following ratios to divide the data set;

a) 50 train: 50 test

b) 75 train: 25 test

c) 25 train: 75 test

d) 85 train: 15 test

The parameter SplitRatio in sample.split is confusing me, I've checked the documentation but it's not clear to me what it does, it looks like a percentage of success to me to determine true or false for the decision tree. Question: do I make SplitRatio 0.5 to have a 50 train, 50 test ratio or do I just have the dataset itself modified to include a random seed of 50 rows, 75, 25, etc.? I have SplitRatio set to 0.9 here and the dataset itself is modified to only include 50 entries. If I change it to 0.5 it dramatically changes the decision tree, same thing if I include the entire dataset instead of 50.

#---------------------------------
#    Ratio 50 Train : 50 Test
#---------------------------------

set.seed(1)
set50 <- sample(nrow(cancerdata), 50, replace=FALSE)
#set50

cancerset5050 <- cancerdata[set50,]
cancerset5050

?sample.split

spl = sample.split(cancerset5050$study.Diagnosis, SplitRatio = 0.9)
spl

dataTrain = subset(cancerset5050, spl==TRUE)
dataTest = subset(cancerset5050, spl==FALSE)

m5050 <- J48(as.factor(study.Diagnosis)~., dataTrain) 

summary(m5050)

## visualization the model
## use partykit package
if(require("partykit", quietly = TRUE)) plot(m5050)

dataTest.pred <- predict(m5050, newdata = dataTest)
table(dataTest$study.Diagnosis, dataTest.pred)

Solution

  • I think your understanding of sample.split function is correct. If you set SplitRatio = 0.5, then you will have 50% of the samples in training set, and the remaining 50% in the test set as you have done it.

    I think you should convert your response variable to factor before you separate the training and test sets.

    That is

    cancerset5050$study.Diagnosis <- as factor(cancerset5050$study.Diagnosis)
    

    And the move on to train and test

    dataTrain = subset(cancerset5050, spl==TRUE)
    dataTest = subset(cancerset5050, spl==FALSE)
    
    m5050 <- J48(study.Diagnosis ~., dataTrain)