Here's the instructions in the homework and I have two separate code implementations but I'm having trouble interpreting what the instructions are asking:
Use the following ratios to divide the data set;
a) 50 train: 50 test
b) 75 train: 25 test
c) 25 train: 75 test
d) 85 train: 15 test
The parameter SplitRatio in sample.split is confusing me, I've checked the documentation but it's not clear to me what it does, it looks like a percentage of success to me to determine true or false for the decision tree. Question: do I make SplitRatio 0.5 to have a 50 train, 50 test ratio or do I just have the dataset itself modified to include a random seed of 50 rows, 75, 25, etc.? I have SplitRatio set to 0.9 here and the dataset itself is modified to only include 50 entries. If I change it to 0.5 it dramatically changes the decision tree, same thing if I include the entire dataset instead of 50.
#---------------------------------
# Ratio 50 Train : 50 Test
#---------------------------------
set.seed(1)
set50 <- sample(nrow(cancerdata), 50, replace=FALSE)
#set50
cancerset5050 <- cancerdata[set50,]
cancerset5050
?sample.split
spl = sample.split(cancerset5050$study.Diagnosis, SplitRatio = 0.9)
spl
dataTrain = subset(cancerset5050, spl==TRUE)
dataTest = subset(cancerset5050, spl==FALSE)
m5050 <- J48(as.factor(study.Diagnosis)~., dataTrain)
summary(m5050)
## visualization the model
## use partykit package
if(require("partykit", quietly = TRUE)) plot(m5050)
dataTest.pred <- predict(m5050, newdata = dataTest)
table(dataTest$study.Diagnosis, dataTest.pred)
I think your understanding of sample.split
function is correct. If you set SplitRatio = 0.5
, then you will have 50% of the samples in training set, and the remaining 50% in the test set as you have done it.
I think you should convert your response variable to factor before you separate the training and test sets.
That is
cancerset5050$study.Diagnosis <- as factor(cancerset5050$study.Diagnosis)
And the move on to train and test
dataTrain = subset(cancerset5050, spl==TRUE)
dataTest = subset(cancerset5050, spl==FALSE)
m5050 <- J48(study.Diagnosis ~., dataTrain)