I have a data set which I intend to split between a training set
and testing set
for a machine learning
analysis using R
.
Assuming my data set (called MyDataset
) has a ratio of Yes (60%) and No (40%) based on the target variable (called Leaver
), how can I ensure that my split will maintain that ratio in both the training set and the testing set?
What you want to do is stratified splitting of your dataset. You can do this with the createDataPartition
from the caret package. Just make sure your Leaver
variable is set as a factor.
See a code example below.
library(caret)
data(GermanCredit)
prop.table(table(GermanCredit$Class))
Bad Good
0.3 0.7
index <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)
# train
prop.table(table(GermanCredit$Class[index]))
Bad Good
0.3 0.7
#test
prop.table(table(GermanCredit$Class[-index]))
Bad Good
0.3 0.7