Search code examples
rrandomdplyrgroupingsample

randomly assign character variables that vary by strata/group in R


I am trying to make a mock employee data set to practice some analyses. I already have a mock data set that has fake employee names, work ids, gender, and ethnicity. I also want to add other variables, such as supervisor status and pay grade. However, male employees are more likely to be supervisors, for instance, than female employees in the actual dataset, so rather than telling R to make 30% of cases supervisor and 70% non-supervisors, I want R to make 20% of female cases and 30% of male cases supervisors.

I've tried using case_when() or group_by() along with the sample() function, but I can't get it to work.

An ideal solution would be able to be scaled further than just dichotomous variables because pay grade and ethnicity have 5 levels. In addition, if I could scale the solution to account for multiple variables (say, gender and ethnicity), that would be the best.

Here's some fake data with 5 male and 5 female cases. For this case, let's say I want 40% of male cases supervisors (2/5) and only 20% of female cases supervisors (1/5).

library(tidyverse)
test <- tibble(emp_num = 1:10,
               ethnicity = c("White", "White", "Hispanic", "Black", "Asian", "White", "White", "Hispanic", "Black", "Asian"),
               gender = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"))

Here is how the answer should look with the correct proportions (of course, which employee number is supervisor doesn't matter for this case, just as long as the different proportions by male and female emerge).

sample_answer <- tibble(emp_num = 1:10,
               ethnicity = c("White", "White", "Hispanic", "Black", "Asian", "White", "White", "Hispanic", "Black", "Asian"),
               gender = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
               sup_status = c("Supervisor", "Supervisor", "Supervisor", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super"))

Solution

  • After troubleshooting code I wrote before posting and the answer posted by @Lukas, I found out the issue. The size argument needs to equal the total sample size, and the replace argument needs to be set to TRUE

    test <- test %>% mutate(supervisor = case_when(
             gender == "Male" ~ sample(c("Supervisor", "Non-Super"), nrow(test), replace = TRUE prob=(c(.4, .6))),
             gender == "Female" ~ sample(c("Supervisor", "Non-Super"), nrow(test), replace = TRUE prob=(c(.2, .8))))
    

    In this small sample size, you may not find exact breakdowns along these probabilities, but I ran this code on my sample of 5000 and the probabilities are within rounding error.

    EDIT: If you want your new variables to vary by multiple groups (say gender and ethnicity, you can do it as you would with any case when. For instance:

    library(tidyverse)
    test %>%
      mutate(sup_status = case_when(
        gender == "Female" & ethnicity == "Black" | ethnicity == "Asian" | ethnicity == "Hispanic" ~ sample(c("Supervisor", "Not Supervisor"), nrow(test), replace=TRUE, prob=c(.10, .90)),
        gender == "Female" & ethnicity == "White" ~ sample(c("Supervisor", "Not Supervisor"), nrow(test), replace=TRUE, prob=c(.20, .80)),
        gender == "Male" & ethnicity == "Black" | ethnicity == "Asian" | ethnicity == "Hispanic" ~ sample(c("Supervisor", "Not Supervisor"), nrow(test), replace=TRUE, prob=c(.15, .85)),
        gender == "Male" & ethnicity == "White" ~ sample(c("Supervisor", "Not Supervisor"), nrow(test), replace=TRUE, prob=c(.25, .75)),
      ))
    

    Again, with such a small sample dataset, you won't see output with these exact probabilities, but with larger datasets, it will work.