Question: Can you really reliably perform filtering based on a draw inside of the filter, or is it doomed to fail?
Objective: Inside of dplyr::filter()
, remove rows of type A (versicolor) if a random draw fails, and remove rows of type B (virginica) if the draw passes. Should always finish with 5 random rows of setosa data, and 1 random row of either virginica (1/8 chance) or versicolor (7/8 chance).
Attempt:
as_tibble(iris) %>%
group_by(Species) %>%
mutate(draw = case_when(
Species == "setosa" ~ 5,
Species == "versicolor" ~ 1,
Species == "virginica" ~ 1
)) %>%
slice(sample(n(),draw[1])) %>%
filter(
if(round(runif(1),3) <= 1/8){ Species != "versicolor" }
else { Species != "virginica" }
)
Problem: While this typically works by returning only one or the other, I will occasionally get both or neither. Out of curiosity, it appears that I get both 11% of the time, neither 11% of the time, and just one (which is correct) only 78% of the time.
I understand that one solution would be to end the pipe after the slice, then do the draw inside an if()
statement:
data <- as_tibble(iris) %>%
group_by(Species) %>%
mutate(draw = case_when(
Species == "setosa" ~ 5,
Species == "versicolor" ~ 1,
Species == "virginica" ~ 1
)) %>%
slice(sample(n(),draw[1]))
if(round(runif(1),3) <= 1/8){
data %>%
filter(Species != "versicolor")
}
else {
data %>%
filter(Species != "virginica")
}
However, since I realized that it is possible to perform the draw inside the filter, I am interested in learning if it is practical.
Why don't you calculate the probability before?
as_tibble(iris) %>%
mutate(prob = runif(1)) %>%
group_by(Species) %>%
mutate(draw = case_when(
Species == "setosa" ~ 5,
Species == "versicolor" ~ +(prob <= 1/8),
Species == "virginica" ~ +(prob > 1/8)
)) %>%
slice(sample(n(),draw[1]))
It looks cleaner then the if ... else
at the end.
The reason why sometimes it print neither or both is because you don't extract 1 random number, but 3! One for each group.
filter
in your code is being applied tp a groupped data.frame
, so you get a random number per each group.
Try to run this code. You will see it will print 3 numbers.
print_random_number <- function(x, Species){
cat(sprintf("%10s: %s <= 1/8 --> %s\n", Species, x, x <= 1/8))
x
}
set.seed(6)
as_tibble(iris) %>%
group_by(Species) %>%
mutate(draw = case_when(
Species == "setosa" ~ 5,
Species == "versicolor" ~ 1,
Species == "virginica" ~ 1
)) %>%
slice(sample(n(),draw[1])) %>%
filter(
if(print_random_number(round(runif(1),3), Species[1]) <= 1/8){ Species != "versicolor" }
else { Species != "virginica" }
)
#> setosa: 0.916 <= 1/8 --> FALSE
#> versicolor: 0.095 <= 1/8 --> TRUE
#> virginica: 0.295 <= 1/8 --> FALSE
#> # A tibble: 5 × 6
#> # Groups: Species [1]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species draw
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.9 3.1 1.5 0.1 setosa 5
#> 2 5.1 3.8 1.9 0.4 setosa 5
#> 3 4.3 3 1.1 0.1 setosa 5
#> 4 5 3.2 1.2 0.2 setosa 5
#> 5 4.7 3.2 1.3 0.2 setosa 5
Therefore, you could correct your code simply by adding ungroup
before filter
:
set.seed(6)
as_tibble(iris) %>%
group_by(Species) %>%
mutate(draw = case_when(
Species == "setosa" ~ 5,
Species == "versicolor" ~ 1,
Species == "virginica" ~ 1
)) %>%
slice(sample(n(),draw[1])) %>%
ungroup() %>%
filter(
# I added a print statement here to see what number was pulled
if(print(round(runif(1),3)) <= 1/8){ Species != "versicolor" }
else { Species != "virginica" }
)
#> [1] 0.916
#> # A tibble: 6 × 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species draw
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.9 3.1 1.5 0.1 setosa 5
#> 2 5.1 3.8 1.9 0.4 setosa 5
#> 3 4.3 3 1.1 0.1 setosa 5
#> 4 5 3.2 1.2 0.2 setosa 5
#> 5 4.7 3.2 1.3 0.2 setosa 5
#> 6 6.4 3.2 4.5 1.5 versicolor 1