Search code examples
rdataframerandomtidyversesampling

Sample from a grouped dataframe with specified probabilities in R


Below, I first group my data.frame (d) by two categorical variables. First, by gender (2-levels; M/F). Second, by sector (Education, Industry, NGO, Private, Public). Then, I want to sample from each level of sector with the following probabilities: c(.2, .3, .3, .1, .1), and gender by following probabilities c(.4, .6).

I'm using the code below to achieve my goal without success? Is there a fix for that?

Would you please comment if generally my code does what I describe correctly?

d <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/su.csv')

library(tidyverse)

set.seed(1)
(out <- d %>%
  group_by(gender,sector) %>%
  slice_sample(n = 2, weight_by = c(.4, .6, .2, .3, .3, .1, .1))) # `Error:  incorrect number of probabilities`

Solution

  • Well slice_sample won't do exactly what you want so I recommend you use splitstackshape to do the job. Install and load as necessary

    # install.packages("splitstackshape")
    library(splitstackshape)
    

    There are shorter faster ways to specify the proportions table but I'll do it methodically starting from the total sample wanted, which in this case we'll make n = 100 then we'll specify the percentages for the various factor levels.

    total_sample <- 100
    M_percent <- .4
    F_percent <- .6
    Education_percent <- .2
    Industry_percent <- .3
    NGO_percent <- .3
    Private_percent <- .1
    Public_percent <- .1
    

    Then we call the function stratified with first a vector of the two columns we're operating on, then a vector of groups and the number wanted which we'll calculate from the percentages above...

    abc <- 
       stratified(indt = d, 
                  c("gender", "sector"), 
                  c("F Education" = F_percent * Education_percent * total_sample, 
                    "M Education" = M_percent * Education_percent * total_sample,
                    "F Industry" = F_percent * Industry_percent * total_sample, 
                    "M Industry" = M_percent * Industry_percent * total_sample,
                    "F NGO" = F_percent * NGO_percent * total_sample, 
                    "M NGO" = M_percent * NGO_percent * total_sample,
                    "F Private" = F_percent * Private_percent * total_sample, 
                    "M Private" = M_percent * Private_percent * total_sample,
                    "F Public" = F_percent * Public_percent * total_sample, 
                    "M Public" = M_percent * Public_percent * total_sample)
                  )
    

    We get back the randomly selected quantities we requested

    head(abc, 20)
                fake.name    sector pretest state gender    pre                    email       phone
     1:            Correa Education    1254    TX      F Medium            Correa@...com xxx-xx-1886
     2:        Manzanares Education    1227    CA      F    Low        Manzanares@...com xxx-xx-1539
     3:          el-Daoud Education    1409    CA      F   High          el-Daoud@...com xxx-xx-9972
     4:            Engman Education    1436    CA      F   High            Engman@...com xxx-xx-9446
     5:           el-Kaba Education    1305    NY      F Medium           el-Kaba@...com xxx-xx-7060
     6:           Herrera Education    1405    NY      F   High           Herrera@...com xxx-xx-9146
     7:           el-Sham Education    1286    TX      F Medium           el-Sham@...com xxx-xx-4046
     8:          Harrison Education    1112    NY      F    Low          Harrison@...com xxx-xx-3118
     9:               Zhu Education    1055    CA      F    Low               Zhu@...com xxx-xx-6223
    10:  Deguzman Gransee Education    1312    TX      F Medium  Deguzman Gransee@...com xxx-xx-5676
    11:           Kearney Education    1303    NY      F Medium           Kearney@...com xxx-xx-5145
    12: Hernandez Mendoza Education    1139    CA      F    Low Hernandez Mendoza@...com xxx-xx-9642
    13:            Barros Education    1416    NY      M   High            Barros@...com xxx-xx-2455
    14:            Torres Education    1370    CA      M   High            Torres@...com xxx-xx-2129
    15:              King Education    1346    CA      M Medium              King@...com xxx-xx-5351
    16:           Cabrera Education    1188    NY      M    Low           Cabrera@...com xxx-xx-6349
    17:               Lee Education    1208    CA      M    Low               Lee@...com xxx-xx-7713
    18:            Vernon Education    1216    TX      M    Low            Vernon@...com xxx-xx-7649
    19:       Ripoll-Bunn Education    1419    TX      M   High       Ripoll-Bunn@...com xxx-xx-8126
    20:             Ashby Education    1295    TX      M Medium             Ashby@...com xxx-xx-8416