Search code examples
rdplyrtidyversesampling

slice_sample producing different samples in grouped .data


Why do the following grouping methods results in different samples. My assumption was that the grouping results to similar samples?

small <- data.frame(
  id = 1:100,
  gender = rep(c('male', 'female'))
)

set.seed(123)
small |> 
  group_by(gender) |> 
  slice_sample(n = 10, replace = F)

set.seed(123)
small |> 
  slice_sample(n = 10, replace = F, by = gender)

Solution

  • Basically when you use .by the groups are sorted by order of first appearance and when you use group_by(), the groups are sorted. Since we see 'small' before 'female', this explains the difference in the results.

    My package timeplyr actually has arguments to control this behaviour.

    Edit: You can also control this behaviour through fgroup_by(order =)

    As to why the actual samples are different within each group, my best guess has is even though the seed is the same, because the sampling is done in a different by-group order, this will affect which samples are drawn.

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    small <- data.frame(
      id = 1:100,
      gender = rep(c('male', 'female'))
    )
    
    set.seed(123)
    res1 <- small |> 
      group_by(gender) |> 
      slice_sample(n = 10, replace = F)
    
    set.seed(123)
    res2 <- small |> 
      slice_sample(n = 10, replace = F, by = gender)
    
    library(timeplyr)
    #> 
    #> Attaching package: 'timeplyr'
    #> The following object is masked from 'package:dplyr':
    #> 
    #>     desc
    
    res3 <- small |> 
      fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = TRUE)
    res4 <- small |> 
      fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = FALSE)
    
    identical(as.data.frame(res1), res3)
    #> [1] TRUE
    identical(as.data.frame(res2), res4)
    #> [1] TRUE
    
    res5 <- small |> 
      fgroup_by(gender, order = TRUE) |> 
      fslice_sample(n = 10, replace = F, seed = 123)
    res6 <- small |> 
      fgroup_by(gender, order = FALSE) |> 
      fslice_sample(n = 10, replace = F, seed = 123)
    
    identical(as.data.frame(res1), as.data.frame(res5))
    #> [1] TRUE
    identical(as.data.frame(res2), as.data.frame(res6))
    #> [1] TRUE
    

    Created on 2024-08-01 with reprex v2.0.2