Say I have a population of mixed ages and genders (and maybe other attributes), and I want to generate a random subsample (with replacement is ok) with certain attributes, e.g.:
I could first randomly pick N/2 people with age<30 and age>=30, but this would likely not have the correct gender mix. I could sub-select and ensure that of the age<30 people, 20% are male, but this is too highly specified - I want the overall distributions to match but not specify anything about the product of age and gender.
How do I generate this sample? What if I made it slightly more complicated and specified ranges:
I imagine it might be possible to iteratively generate such a sample, alternately pruning it to match the each requirement until convergence, but I'm not sure how to do it properly. The dumbest way of course would be to just generate random samples and reject them if they don't match these requirements.
EDIT:
Here's a sample that is 70% under 30 and 20% male:
N <- 100000
orig_u30 <- 0.7
orig_male <- 0.2
set.seed(42)
my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T,
prob = c(orig_u30, 1 - orig_u30)),
gender = sample(c("M", "F"), N, replace = T,
prob = c(male, 1-male)))
addmargins(prop.table(table(my_sample$age, my_sample$gender)))
F M Sum
30+ 0.24292 0.05935 0.30227
under 30 0.55675 0.14098 0.69773
Sum 0.79967 0.20033 1.00000
Suppose we want a subsample of those that is weighted instead 40% under 30 and 40% male. We could achieve that by applying weights to each row depending on the relative proportions of what we want vs. what we have.
old_u30 = mean(my_sample$age == "under 30")
new_u30 = 0.4
weight_u30 = (new_u30 / old_u30) / ((1-new_u30) / (1-old_u30))
old_male = mean(my_sample$gender == "M")
new_male = 0.4
weight_male = (new_male / old_male) / ((1-new_male) / (1-old_male))
my_sample$weight = ifelse(my_sample$age == "under 30", weight_u30, 1) *
ifelse(my_sample$gender == "M", weight_male, 1)
Now we have a weighting for each row that will tend to bring it toward the desired shares:
library(dplyr)
my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight)
addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))
Now it's 40% male and 40% under 30:
F M Sum
30+ 0.3683 0.2348 0.6031
under 30 0.2375 0.1594 0.3969
Sum 0.6058 0.3942 1.0000
Orig answer: generated weighted sample but not weighted subsample
N <- 1000
median_age <- 30
male <- 0.2
my_sample <- data.frame(age = rpois(N, median_age),
gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))
median(my_sample$age) # will be 30 most runs
table(my_sample$gender) # will be around 200 / 1000