I'm quite new to R but I have a question. I have a dataset(with a length of 1593 obs) that includes a character type variable that has several strings inside and a factor variable with 2 levels -0 and 1- corresponding to each string. In order to create a classification, I want to divide 75% of this dataset as test and 25% as training samples but I also want to have the same proportion of 0s in both of the test and training samples. Are there any ways to do this?
here is the structure of my dataset
data.frame': 1593 obs. of 6 variables:
$ match_id: int 0 0 0 0 0 0 0 0 0 0 ...
$ Binary : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ key : chr "force it" "space created" "hah" "ez 500" ...
Note: I am actually following the codes from the book "Machine Learning with R" by Brett Lantz and applying them to my dataset. The part that I want to achieve in my dataset is this part from the book :
To confirm that the subsets are representative of the complete set of SMS data, let's
compare the proportion of spam in the training and test data frames:
> prop.table(table(sms_raw_train$type))
ham spam
0.8647158 0.1352842
> prop.table(table(sms_raw_test$type))
ham spam
0.8683453 0.1316547
Both the training data and test data contain about 13 percent spam. This suggests
that the spam messages were divided evenly between the two datasets.
thanks for any help
The createDataPartition()
function from the caret package is typically used for this purpose, e.g.
library(caret)
set.seed(300)
trainIndex <- createDataPartition(iris$Species, p = .75,
list = FALSE,
times = 1)
irisTrain <- iris[ trainIndex,]
irisTest <- iris[-trainIndex,]
str(irisTrain)
>'data.frame': 114 obs. of 5 variables:
> $ Sepal.Length: num 5.1 4.9 4.7 5 5.4 4.6 5 4.4 5.4 4.8 ...
> $ Sepal.Width : num 3.5 3 3.2 3.6 3.9 3.4 3.4 2.9 3.7 3.4 ...
> $ Petal.Length: num 1.4 1.4 1.3 1.4 1.7 1.4 1.5 1.4 1.5 1.6 ...
> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.2 0.2 ...
> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...
str(irisTest)
>'data.frame': 36 obs. of 5 variables:
> $ Sepal.Length: num 4.6 4.9 5.1 5.1 4.6 4.8 5.2 5.5 5.5 5.1 ...
> $ Sepal.Width : num 3.1 3.1 3.5 3.8 3.6 3.1 4.1 4.2 3.5 3.8 ...
> $ Petal.Length: num 1.5 1.5 1.4 1.5 1 1.6 1.5 1.4 1.3 1.9 ...
> $ Petal.Width : num 0.2 0.1 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.4 ...
> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...
prop.table(table(irisTrain$Species))
> setosa versicolor virginica
> 0.3333333 0.3333333 0.3333333
prop.table(table(irisTest$Species))
> setosa versicolor virginica
> 0.3333333 0.3333333 0.3333333
This provides a pseudorandom ~stratified sampling into train and test cohorts and it's what I use in my own work.