I have a sample of 150 observations. I want to randomly select 24 rows (individuals) based on three conditions. The data comes from three different regions with two possible genders and 6 possible age groups. So each sample should have one man and woman from each region from each age group.
Question 1a: I have code to select based one condition (for example, below pick 2 from each age group) but how can I expand this for all the other options I have specified above?
Question 1b: Then, how can I save the IDs from each sample?.
#create data
set.seed(1)
mydf <- data.frame(ID = rep(1:150), age = rep(1:6), region = rep(1:3), gender = rep(1:2))
rankings <- data.frame(matrix(rnorm(45), ncol=150))
colnames(rankings) <- mydf$ID #rename columns with id because each column in rankings is a person
#Sample conditionally
sample_each <- function(data, var, n = 1L) {
lvl <- table(data[, var])
n1 <- setNames(rep_len(n, length(lvl)), names(lvl))
n0 <- lvl - n1
idx <- ave(as.character(data[, var]), data[, var], FUN = function(x)
sample(rep(0:1, c(n0[x[1]], n1[x[1]]))))
data[!!(as.numeric(idx)), ]
}
#Try sampling
sample_each(mydf, 'age', 2)
In dplyr
you could do this...
library(dplyr)
df2 <- mydf %>% group_by(age, region, gender) %>% sample_n(1) #select one from each group
sample <- mydf %>% sample_n(24 - nrow(df2)) %>% #select rest randomly
bind_rows(df2) #add first set back in
Your example data does not cover all the possible groups because of the way you have constructed it (6=2*3, so very cyclic), but this approach should work in a more general case.