Search code examples
rdplyr

Sample a proportional number of rows from a grouped data frame


In this data frame:

df <- data.frame(
  Story = c(rep("C", 6), rep("X", 9), rep("A", 15), rep("B",12))
)

I want to randomly sample roughly 33% of all rows in each Story. This proves harder than I thought. This method, for example, using ceiling and slice_sample does not get the desired result:

df %>%
  group_by(Story) %>%
  mutate(ID = row_number()) %>%
  mutate(sample_size = ceiling(n() * 0.33)) %>% 
  slice_sample(n = unique(sample_size))

The desired results has:

  • 2 "C"s
  • 3 "X"s
  • 5 "A"s
  • 4 "B"s

Solution

  • What about just prop = 1/3 with slice_sample?

    > df %>%
    +     slice_sample(prop = 1 / 3, by = Story)
       Story
    1      C
    2      C
    3      X
    4      X
    5      X
    6      A
    7      A
    8      A
    9      A
    10     A
    11     B
    12     B
    13     B
    14     B
    

    or if you like to use 0.33 and ceiling

    > df %>%
    +     filter(row_number() %in% sample(n(), ceiling(n() * 0.33)), .by = Story)
       Story
    1      C
    2      C
    3      X
    4      X
    5      X
    6      A
    7      A
    8      A
    9      A
    10     A
    11     B
    12     B
    13     B
    14     B