Search code examples
rdplyrsubset

Subsampling a set number of levels from a factor but including all rows of each level in R


Say for dataframe df

    species  diet
 a1  blue    round
 a2  blue    round 
 a3  red     round
 a4  yellow  round
 a5  yellow  round
 a6  yellow  round
 a7  black   square
 a8  white   square
 a9  maroon  square
 a10 orange  square
 a11 orange  square
 a12 orange  square

For each diet I want to randomly sample 2 species but INCLUDE all rows of that species. So if I run the code properly for the above df I can randomly end up with :

    species  diet
 a1  blue    round
 a2  blue    round 
 a3  red     round
 a7  black   square
 a8  white   square

Where for the round diet, blue and red are picked with all rows included and yellow was dropped. And for square, black and white were randomly picked and the others dropped.

I know there is the sample_n command but it will only pick a set number of rows for a given level. I need to pick 2 random levels and keep all the rows associated with the two random levels. Any ideas how to do this? Ideally I would want to run a for loop to do this many many times without replacement but I'll take an answer for just how to do one iteration. Thanks!!!!

Cheers, Sam


Solution

  • Using group_by and filter you could do:

    set.seed(123)
    
    library(dplyr, warn=FALSE)
    
    dat |>
      tibble::rownames_to_column(var = "id") |>
      group_by(diet) |>
      filter(species %in% sample(unique(species), 2)) |>
      ungroup()
    #> # A tibble: 7 × 3
    #>   id    species diet  
    #>   <chr> <chr>   <chr> 
    #> 1 a1    blue    round 
    #> 2 a2    blue    round 
    #> 3 a4    yellow  round 
    #> 4 a5    yellow  round 
    #> 5 a6    yellow  round 
    #> 6 a8    white   square
    #> 7 a9    maroon  square
    

    DATA

    dat <- structure(list(species = c(
      "blue", "blue", "red", "yellow", "yellow",
      "yellow", "black", "white", "maroon", "orange", "orange", "orange"
    ), diet = c(
      "round", "round", "round", "round", "round", "round",
      "square", "square", "square", "square", "square", "square"
    )), class = "data.frame", row.names = c(
      "a1",
      "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "a10", "a11",
      "a12"
    ))