Search code examples

r generate a column with random 1s and 0s with restrictions

I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios

Current Dataset

  Id     Age    Category   
  1      23     1
  2      24     1
  3      21     2
  .      .      .
  .      .      .
  .      .      .
500      27     3

Scenario 1

  • The total number of 1s should be 200 and they should be random. The remaining 300 should be 0s.

Scenario 2

  • The total number of 1s should be 200. The remaining 300 should be 0s.
    • 40% of the 1s should be in Category1. That is 80 1s should be in Category1
    • 40% of the 1s should be in Category2 That is 80 1s should be in Category2
    • 20% of the 1s should be in Category3 That is 40 1s should be in Category3

Expected Output

  Id     Age    Category  Indicator  
  1      23     1         1
  2      24     1         0
  3      21     2         1
  .      .      .
  .      .      .
  .      .      .
500      27     3         1

I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.


  • Here's a full worked example.

    Let's say your data looked like this:

    df <- data.frame(id = 1:500, 
                     Age = 20 + sample(10, 500, TRUE),
                     Category = sample(3, 500, TRUE))
    #>   id Age Category
    #> 1  1  21        2
    #> 2  2  22        2
    #> 3  3  28        3
    #> 4  4  27        2
    #> 5  5  27        1
    #> 6  6  26        2

    Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:

    #>   1   2   3 
    #> 153 179 168 

    Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:

    df$label <- numeric(nrow(df))
    df$label[sample(nrow(df), 200)] <- 1
    #>   id Age Category label
    #> 1  1  21        2     1
    #> 2  2  22        2     1
    #> 3  3  28        3     0
    #> 4  4  27        2     0
    #> 5  5  27        1     0
    #> 6  6  26        2     1

    So we have random zeros and ones, but when we count them, we have:

    #>   0   1 
    #> 300 200

    Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:

    df$label <- numeric(nrow(df))
    df <-"rbind", lapply(split(df, df$Category), function(d) {
      n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
      d$label[sample(nrow(d), n_ones)] <- 1 
    #>      id Age Category label
    #> 1.5   5  27        1     0
    #> 1.10 10  24        1     0
    #> 1.13 13  23        1     1
    #> 1.19 19  24        1     0
    #> 1.26 26  22        1     1
    #> 1.27 27  24        1     1

    Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:

    label_table <- table(df$Category, df$label)
    #>       0   1
    #>   1  92  61
    #>   2 107  72
    #>   3 134  34
    apply(label_table, 1, function(x) x[2]/sum(x))
    #>         1         2         3 
    #> 0.3986928 0.4022346 0.2023810

    Created on 2020-08-12 by the reprex package (v0.3.0)