Search code examples
rrandomsampling

r generate a column with random 1s and 0s with restrictions


I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios

Current Dataset

  Id     Age    Category   
  1      23     1
  2      24     1
  3      21     2
  .      .      .
  .      .      .
  .      .      .
500      27     3

Scenario 1

  • The total number of 1s should be 200 and they should be random. The remaining 300 should be 0s.

Scenario 2

  • The total number of 1s should be 200. The remaining 300 should be 0s.
    • 40% of the 1s should be in Category1. That is 80 1s should be in Category1
    • 40% of the 1s should be in Category2 That is 80 1s should be in Category2
    • 20% of the 1s should be in Category3 That is 40 1s should be in Category3

Expected Output

  Id     Age    Category  Indicator  
  1      23     1         1
  2      24     1         0
  3      21     2         1
  .      .      .
  .      .      .
  .      .      .
500      27     3         1

I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.


Solution

  • Here's a full worked example.

    Let's say your data looked like this:

    set.seed(69)
    
    df <- data.frame(id = 1:500, 
                     Age = 20 + sample(10, 500, TRUE),
                     Category = sample(3, 500, TRUE))
    
    head(df)
    #>   id Age Category
    #> 1  1  21        2
    #> 2  2  22        2
    #> 3  3  28        3
    #> 4  4  27        2
    #> 5  5  27        1
    #> 6  6  26        2
    

    Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:

    table(df$Category)
    
    #>   1   2   3 
    #> 153 179 168 
    

    Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:

    df$label <- numeric(nrow(df))
    df$label[sample(nrow(df), 200)] <- 1
    
    head(df)
    #>   id Age Category label
    #> 1  1  21        2     1
    #> 2  2  22        2     1
    #> 3  3  28        3     0
    #> 4  4  27        2     0
    #> 5  5  27        1     0
    #> 6  6  26        2     1
    

    So we have random zeros and ones, but when we count them, we have:

    table(df$label)
    #> 
    #>   0   1 
    #> 300 200
    

    Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:

    df$label <- numeric(nrow(df))
    df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
      n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
      d$label[sample(nrow(d), n_ones)] <- 1 
      d
      }))
    
    head(df)
    #>      id Age Category label
    #> 1.5   5  27        1     0
    #> 1.10 10  24        1     0
    #> 1.13 13  23        1     1
    #> 1.19 19  24        1     0
    #> 1.26 26  22        1     1
    #> 1.27 27  24        1     1
    

    Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:

    label_table <- table(df$Category, df$label)
    label_table   
    #>       0   1
    #>   1  92  61
    #>   2 107  72
    #>   3 134  34
    
    apply(label_table, 1, function(x) x[2]/sum(x))
    #>         1         2         3 
    #> 0.3986928 0.4022346 0.2023810
    

    Created on 2020-08-12 by the reprex package (v0.3.0)