Why do the following grouping methods results in different samples. My assumption was that the grouping results to similar samples?
small <- data.frame(
id = 1:100,
gender = rep(c('male', 'female'))
)
set.seed(123)
small |>
group_by(gender) |>
slice_sample(n = 10, replace = F)
set.seed(123)
small |>
slice_sample(n = 10, replace = F, by = gender)
Basically when you use .by
the groups are sorted by order of first appearance and when you use group_by()
, the groups are sorted. Since we see 'small' before 'female', this explains the difference in the results.
My package timeplyr actually has arguments to control this behaviour.
Edit: You can also control this behaviour through fgroup_by(order =)
As to why the actual samples are different within each group, my best guess has is even though the seed is the same, because the sampling is done in a different by-group order, this will affect which samples are drawn.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
small <- data.frame(
id = 1:100,
gender = rep(c('male', 'female'))
)
set.seed(123)
res1 <- small |>
group_by(gender) |>
slice_sample(n = 10, replace = F)
set.seed(123)
res2 <- small |>
slice_sample(n = 10, replace = F, by = gender)
library(timeplyr)
#>
#> Attaching package: 'timeplyr'
#> The following object is masked from 'package:dplyr':
#>
#> desc
res3 <- small |>
fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = TRUE)
res4 <- small |>
fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = FALSE)
identical(as.data.frame(res1), res3)
#> [1] TRUE
identical(as.data.frame(res2), res4)
#> [1] TRUE
res5 <- small |>
fgroup_by(gender, order = TRUE) |>
fslice_sample(n = 10, replace = F, seed = 123)
res6 <- small |>
fgroup_by(gender, order = FALSE) |>
fslice_sample(n = 10, replace = F, seed = 123)
identical(as.data.frame(res1), as.data.frame(res5))
#> [1] TRUE
identical(as.data.frame(res2), as.data.frame(res6))
#> [1] TRUE
Created on 2024-08-01 with reprex v2.0.2