Search code examples
pythonrsampling

How can I generate a random subsample of a population with specific requirements?


Say I have a population of mixed ages and genders (and maybe other attributes), and I want to generate a random subsample (with replacement is ok) with certain attributes, e.g.:

  • Sample size N
  • 50% of the sample should be age<30
  • 20% of the sample should be male

I could first randomly pick N/2 people with age<30 and age>=30, but this would likely not have the correct gender mix. I could sub-select and ensure that of the age<30 people, 20% are male, but this is too highly specified - I want the overall distributions to match but not specify anything about the product of age and gender.

How do I generate this sample? What if I made it slightly more complicated and specified ranges:

  • Sample size N
  • 50-80% under age 30 (uniform probability in that range)
  • 20-30% male (uniform probability in that range)

I imagine it might be possible to iteratively generate such a sample, alternately pruning it to match the each requirement until convergence, but I'm not sure how to do it properly. The dumbest way of course would be to just generate random samples and reject them if they don't match these requirements.


Solution

  • EDIT:

    Here's a sample that is 70% under 30 and 20% male:

    N <- 100000
    orig_u30 <- 0.7
    orig_male <- 0.2
    set.seed(42)
    my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T, 
                                         prob = c(orig_u30, 1 - orig_u30)),
                            gender = sample(c("M", "F"), N, replace = T, 
                                            prob = c(male, 1-male)))
    addmargins(prop.table(table(my_sample$age, my_sample$gender)))
                     F       M     Sum
      30+      0.24292 0.05935 0.30227
      under 30 0.55675 0.14098 0.69773
      Sum      0.79967 0.20033 1.00000
    

    Suppose we want a subsample of those that is weighted instead 40% under 30 and 40% male. We could achieve that by applying weights to each row depending on the relative proportions of what we want vs. what we have.

    old_u30 = mean(my_sample$age == "under 30")
    new_u30 = 0.4
    weight_u30 = (new_u30 / old_u30) / ((1-new_u30) / (1-old_u30))
    
    old_male = mean(my_sample$gender == "M")
    new_male = 0.4
    weight_male = (new_male / old_male) / ((1-new_male) / (1-old_male))
    
    my_sample$weight = ifelse(my_sample$age == "under 30", weight_u30, 1) *
      ifelse(my_sample$gender == "M", weight_male, 1)
    

    Now we have a weighting for each row that will tend to bring it toward the desired shares:

    library(dplyr)
    my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight)
    
    addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))
    

    Now it's 40% male and 40% under 30:

                    F      M    Sum
      30+      0.3683 0.2348 0.6031
      under 30 0.2375 0.1594 0.3969
      Sum      0.6058 0.3942 1.0000
    

    Orig answer: generated weighted sample but not weighted subsample

    N <- 1000
    median_age <- 30
    male <- 0.2
    
    my_sample <- data.frame(age = rpois(N, median_age),
               gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))
    
    median(my_sample$age) # will be 30 most runs
    table(my_sample$gender) # will be around 200 / 1000