Say for dataframe df
species diet
a1 blue round
a2 blue round
a3 red round
a4 yellow round
a5 yellow round
a6 yellow round
a7 black square
a8 white square
a9 maroon square
a10 orange square
a11 orange square
a12 orange square
For each diet I want to randomly sample 2 species but INCLUDE all rows of that species. So if I run the code properly for the above df I can randomly end up with :
species diet
a1 blue round
a2 blue round
a3 red round
a7 black square
a8 white square
Where for the round diet, blue and red are picked with all rows included and yellow was dropped. And for square, black and white were randomly picked and the others dropped.
I know there is the sample_n
command but it will only pick a set number of rows for a given level. I need to pick 2 random levels and keep all the rows associated with the two random levels. Any ideas how to do this? Ideally I would want to run a for loop to do this many many times without replacement but I'll take an answer for just how to do one iteration. Thanks!!!!
Cheers, Sam
Using group_by
and filter
you could do:
set.seed(123)
library(dplyr, warn=FALSE)
dat |>
tibble::rownames_to_column(var = "id") |>
group_by(diet) |>
filter(species %in% sample(unique(species), 2)) |>
ungroup()
#> # A tibble: 7 × 3
#> id species diet
#> <chr> <chr> <chr>
#> 1 a1 blue round
#> 2 a2 blue round
#> 3 a4 yellow round
#> 4 a5 yellow round
#> 5 a6 yellow round
#> 6 a8 white square
#> 7 a9 maroon square
DATA
dat <- structure(list(species = c(
"blue", "blue", "red", "yellow", "yellow",
"yellow", "black", "white", "maroon", "orange", "orange", "orange"
), diet = c(
"round", "round", "round", "round", "round", "round",
"square", "square", "square", "square", "square", "square"
)), class = "data.frame", row.names = c(
"a1",
"a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "a10", "a11",
"a12"
))