How do I split my data set between training and testing sets while keeping the ratio of the target variable in both sets?

I have a data set which I intend to split between a training set and testing set for a machine learning analysis using R.

Assuming my data set (called MyDataset) has a ratio of Yes (60%) and No (40%) based on the target variable (called Leaver), how can I ensure that my split will maintain that ratio in both the training set and the testing set?

Solution

What you want to do is stratified splitting of your dataset. You can do this with the createDataPartition from the caret package. Just make sure your Leaver variable is set as a factor.

See a code example below.

library(caret)
data(GermanCredit)

prop.table(table(GermanCredit$Class))
 Bad Good 
 0.3  0.7 
index <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)

# train
prop.table(table(GermanCredit$Class[index]))
 Bad Good 
 0.3  0.7 
#test
prop.table(table(GermanCredit$Class[-index]))
 Bad Good 
 0.3  0.7