Search code examples
rsample

random sample of rows "with at least x of each group" with several conditions


I have a sample of 150 observations. I want to randomly select 24 rows (individuals) based on three conditions. The data comes from three different regions with two possible genders and 6 possible age groups. So each sample should have one man and woman from each region from each age group.

Question 1a: I have code to select based one condition (for example, below pick 2 from each age group) but how can I expand this for all the other options I have specified above?

Question 1b: Then, how can I save the IDs from each sample?.

#create data
set.seed(1)

mydf <- data.frame(ID = rep(1:150), age = rep(1:6), region = rep(1:3), gender = rep(1:2))
rankings <- data.frame(matrix(rnorm(45), ncol=150))
colnames(rankings) <- mydf$ID               #rename columns with id because each column in rankings is a person


#Sample conditionally
sample_each <- function(data, var, n = 1L) {
  lvl <- table(data[, var])
  n1 <- setNames(rep_len(n, length(lvl)), names(lvl))
  n0 <- lvl - n1
  idx <- ave(as.character(data[, var]), data[, var], FUN = function(x)
    sample(rep(0:1, c(n0[x[1]], n1[x[1]]))))
  data[!!(as.numeric(idx)), ]
}

#Try sampling
sample_each(mydf, 'age', 2)

Solution

  • In dplyr you could do this...

    library(dplyr)
    
    df2 <- mydf %>% group_by(age, region, gender) %>% sample_n(1) #select one from each group
    
    sample <- mydf %>% sample_n(24 - nrow(df2)) %>%               #select rest randomly
                bind_rows(df2)                                    #add first set back in
    

    Your example data does not cover all the possible groups because of the way you have constructed it (6=2*3, so very cyclic), but this approach should work in a more general case.