Search code examples
rmachine-learningtext-miningsampling

Is there a way to divide a dataset having the same proportion of a categorical value in each sample?


I'm quite new to R but I have a question. I have a dataset(with a length of 1593 obs) that includes a character type variable that has several strings inside and a factor variable with 2 levels -0 and 1- corresponding to each string. In order to create a classification, I want to divide 75% of this dataset as test and 25% as training samples but I also want to have the same proportion of 0s in both of the test and training samples. Are there any ways to do this?

here is the structure of my dataset

data.frame':    1593 obs. of  6 variables:
 $ match_id: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Binary  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
 $ key     : chr  "force it" "space created" "hah" "ez 500" ...

Note: I am actually following the codes from the book "Machine Learning with R" by Brett Lantz and applying them to my dataset. The part that I want to achieve in my dataset is this part from the book :

To confirm that the subsets are representative of the complete set of SMS data, let's
compare the proportion of spam in the training and test data frames:
> prop.table(table(sms_raw_train$type))
ham        spam
0.8647158 0.1352842
> prop.table(table(sms_raw_test$type))
ham         spam
0.8683453 0.1316547
Both the training data and test data contain about 13 percent spam. This suggests
that the spam messages were divided evenly between the two datasets.

thanks for any help


Solution

  • The createDataPartition() function from the caret package is typically used for this purpose, e.g.

    library(caret)
    set.seed(300)
    trainIndex <- createDataPartition(iris$Species, p = .75, 
                                      list = FALSE, 
                                      times = 1)
    irisTrain <- iris[ trainIndex,]
    irisTest  <- iris[-trainIndex,]
    
    str(irisTrain)
    >'data.frame':  114 obs. of  5 variables:
    > $ Sepal.Length: num  5.1 4.9 4.7 5 5.4 4.6 5 4.4 5.4 4.8 ...
    > $ Sepal.Width : num  3.5 3 3.2 3.6 3.9 3.4 3.4 2.9 3.7 3.4 ...
    > $ Petal.Length: num  1.4 1.4 1.3 1.4 1.7 1.4 1.5 1.4 1.5 1.6 ...
    > $ Petal.Width : num  0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.2 0.2 ...
    > $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...
    
    str(irisTest)
    >'data.frame':  36 obs. of  5 variables:
    > $ Sepal.Length: num  4.6 4.9 5.1 5.1 4.6 4.8 5.2 5.5 5.5 5.1 ...
    > $ Sepal.Width : num  3.1 3.1 3.5 3.8 3.6 3.1 4.1 4.2 3.5 3.8 ...
    > $ Petal.Length: num  1.5 1.5 1.4 1.5 1 1.6 1.5 1.4 1.3 1.9 ...
    > $ Petal.Width : num  0.2 0.1 0.3 0.3 0.2 0.2 0.1 0.2 0.2 0.4 ...
    > $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 >...
    
    prop.table(table(irisTrain$Species))
    >    setosa versicolor  virginica 
    > 0.3333333  0.3333333  0.3333333 
    
    prop.table(table(irisTest$Species))
    >   setosa versicolor  virginica 
    > 0.3333333  0.3333333  0.3333333 
    

    This provides a pseudorandom ~stratified sampling into train and test cohorts and it's what I use in my own work.