Search code examples
rrandomtidyverse

Take random sample of rows from dataframe with grouping variables


I have a dataframe with the following structure:

dat <- tibble(
  item_type  = rep(1:36, each = 6), 
  condition1 = rep(c("a", "b", "c"), times = 72), 
  condition2 = rep(c("y", "z"), each = 3, times = 36), 
) %>% 
  unite(unique, item_type, condition1, condition2, sep = "-", remove = F)

which looks like this:

# A tibble: 216 × 4
   unique item_type condition1 condition2
   <chr>      <int> <chr>      <chr>     
 1 1-a-y          1 a          y         
 2 1-b-y          1 b          y         
 3 1-c-y          1 c          y         
 4 1-a-z          1 a          z         
 5 1-b-z          1 b          z         
 6 1-c-z          1 c          z         
 7 2-a-y          2 a          y         
 8 2-b-y          2 b          y         
 9 2-c-y          2 c          y         
10 2-a-z          2 a          z    

I would like to take a random sample of 36 rows. The sample should include 6 repetitions of the condition1 by condition2 combinations without repeating item_type.

Using slice_sample() it seems I can get the subset I want...

set.seed(1)
dat %>% 
  slice_sample(n = 6, by = c("condition1", "condition2")) %>% 
  count(condition1, condition2)
  condition1 condition2     n
  <chr>      <chr>      <int>
1 a          y              6
2 a          z              6
3 b          y              6
4 b          z              6
5 c          y              6
6 c          z              6

But on closer inspection I see that item_type is repeated.

set.seed(1)
dat %>% 
  slice_sample(n = 6, by = c("condition1", "condition2")) %>% 
  count(item_type) %>% 
  arrange(desc(n))
# A tibble: 22 × 2
   item_type     n
       <int> <int>
 1        10     3
 2        34     3
 3         1     2
 4         6     2
 5         7     2
 6        15     2
 7        20     2
 8        21     2
 9        23     2
10        25     2
# … with 12 more rows

In other words, I would like only unique pulls overall from item_type. Is it possible to get slice_sample() to do this?

EDIT Adding second toy data example.

dat <- tibble(
  item_type  = rep(1:36, each = 3), 
  condition1 = rep(c("a", "b"), each = 54), 
  condition2 = rep(c("x", "y", "z"), times = 36), 
) %>% 
  unite(unique, item_type, condition1, condition2, sep = "-", remove = F)

Which looks like this:

# A tibble: 108 × 4
   unique item_type condition1 condition2
   <chr>      <int> <chr>      <chr>     
 1 1-a-x          1 a          x         
 2 1-a-y          1 a          y         
 3 1-a-z          1 a          z         
 4 2-a-x          2 a          x         
 5 2-a-y          2 a          y         
 6 2-a-z          2 a          z         
 7 3-a-x          3 a          x         
 8 3-a-y          3 a          y         
 9 3-a-z          3 a          z         
10 4-a-x          4 a          x    

Attempt to sample:

inner_join(
  dat, 
  distinct(dat,condition1, condition2) %>% 
    uncount(n()) %>% 
    mutate(item_type = sample(n()))
) 

Which produces a dataframe of length 20 with the following characteristics:

  condition1 condition2     n
  <chr>      <chr>      <int>
1 a          x              4
2 a          y              4
3 a          z              4
4 b          x              3
5 b          y              4
6 b          z              5

Solution

  • You could do this:

    inner_join(
      dat, 
      distinct(dat,condition1, condition2) %>% 
        uncount(n()) %>% 
        mutate(item_type=sample(n())),
    )
    

    Output:

    # A tibble: 36 × 4
       unique item_type condition1 condition2
       <chr>      <int> <chr>      <chr>     
     1 1-b-z          1 b          z         
     2 2-a-z          2 a          z         
     3 3-c-y          3 c          y         
     4 4-c-z          4 c          z         
     5 5-b-z          5 b          z         
     6 6-a-y          6 a          y         
     7 7-c-y          7 c          y         
     8 8-a-y          8 a          y         
     9 9-a-y          9 a          y         
    10 10-c-z        10 c          z         
    # … with 26 more rows
    

    On the second dataset, you need to get the min/max range to sample:

    inner_join(
      dat, 
      distinct(dat,condition1, condition2) %>% 
        uncount(n()) %>% 
        inner_join(dat %>% group_by(condition1, condition2) %>% summarize(imin = min(item_type), imax=max(item_type), .groups="drop")) %>% 
        group_by(condition1) %>% 
        mutate(item_type = sample(imin[1]:imax[1],size = n())) %>% 
        ungroup() %>% 
        select(-c(imin:imax))
    )
    

    Output:

    # A tibble: 36 × 4
       unique item_type condition1 condition2
       <chr>      <int> <chr>      <chr>     
     1 1-a-y          1 a          y         
     2 2-a-z          2 a          z         
     3 3-a-z          3 a          z         
     4 4-a-y          4 a          y         
     5 5-a-z          5 a          z         
     6 6-a-y          6 a          y         
     7 7-a-x          7 a          x         
     8 8-a-z          8 a          z         
     9 9-a-y          9 a          y         
    10 10-a-z        10 a          z         
    # … with 26 more rows