Issue with dplyr filter in R when attempting to filter based on probability

Question: Can you really reliably perform filtering based on a draw inside of the filter, or is it doomed to fail?

Objective: Inside of dplyr::filter(), remove rows of type A (versicolor) if a random draw fails, and remove rows of type B (virginica) if the draw passes. Should always finish with 5 random rows of setosa data, and 1 random row of either virginica (1/8 chance) or versicolor (7/8 chance).

Attempt:

as_tibble(iris) %>%
   group_by(Species) %>%
   mutate(draw = case_when(
      Species == "setosa" ~ 5,
      Species == "versicolor" ~ 1,
      Species == "virginica" ~ 1
   )) %>%
   slice(sample(n(),draw[1])) %>%
   filter(
      if(round(runif(1),3) <= 1/8){ Species != "versicolor" }
      else { Species != "virginica" }
   )

Problem: While this typically works by returning only one or the other, I will occasionally get both or neither. Out of curiosity, it appears that I get both 11% of the time, neither 11% of the time, and just one (which is correct) only 78% of the time.

I understand that one solution would be to end the pipe after the slice, then do the draw inside an if() statement:

data <- as_tibble(iris) %>%
   group_by(Species) %>%
   mutate(draw = case_when(
      Species == "setosa" ~ 5,
      Species == "versicolor" ~ 1,
      Species == "virginica" ~ 1
   )) %>%
   slice(sample(n(),draw[1]))

if(round(runif(1),3) <= 1/8){
   data %>%
      filter(Species != "versicolor")
}
else {
   data %>%
      filter(Species != "virginica")
}

However, since I realized that it is possible to perform the draw inside the filter, I am interested in learning if it is practical.

Solution

Why don't you calculate the probability before?

as_tibble(iris) %>%
  mutate(prob = runif(1)) %>% 
  group_by(Species) %>%
  mutate(draw = case_when(
    Species == "setosa" ~ 5,
    Species == "versicolor" ~ +(prob <= 1/8),
    Species == "virginica" ~ +(prob > 1/8)
  )) %>%
  slice(sample(n(),draw[1]))

It looks cleaner then the if ... else at the end.

The reason why sometimes it print neither or both is because you don't extract 1 random number, but 3! One for each group.

filter in your code is being applied tp a groupped data.frame, so you get a random number per each group.

Try to run this code. You will see it will print 3 numbers.

The condition for setosa doesn't matter.
The random number for versicolor is 0.095 which will set the condition to TRUE and eliminate versicolor from the filter condition.
The random number for virginica is 0.295 which is greater than 1/8. The condition is FALSE and virginica gets erased from the final result.

print_random_number <- function(x, Species){
  cat(sprintf("%10s: %s <= 1/8 --> %s\n", Species, x, x <= 1/8))
  x
}

set.seed(6)
as_tibble(iris) %>%
  group_by(Species) %>%
  mutate(draw = case_when(
    Species == "setosa" ~ 5,
    Species == "versicolor" ~ 1,
    Species == "virginica" ~ 1
  )) %>%
  slice(sample(n(),draw[1])) %>%
  filter(
    if(print_random_number(round(runif(1),3), Species[1]) <= 1/8){ Species != "versicolor" }
    else { Species != "virginica" }
  )

#>     setosa: 0.916 <= 1/8 --> FALSE
#> versicolor: 0.095 <= 1/8 --> TRUE
#>  virginica: 0.295 <= 1/8 --> FALSE
#> # A tibble: 5 × 6
#> # Groups:   Species [1]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  draw
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
#> 1          4.9         3.1          1.5         0.1 setosa      5
#> 2          5.1         3.8          1.9         0.4 setosa      5
#> 3          4.3         3            1.1         0.1 setosa      5
#> 4          5           3.2          1.2         0.2 setosa      5
#> 5          4.7         3.2          1.3         0.2 setosa      5

Therefore, you could correct your code simply by adding ungroup before filter:

set.seed(6)
as_tibble(iris) %>%
  group_by(Species) %>%
  mutate(draw = case_when(
    Species == "setosa" ~ 5,
    Species == "versicolor" ~ 1,
    Species == "virginica" ~ 1
  )) %>%
  slice(sample(n(),draw[1])) %>%
  ungroup() %>%
  filter(
    # I added a print statement here to see what number was pulled
    if(print(round(runif(1),3)) <= 1/8){ Species != "versicolor" }
    else { Species != "virginica" }
  )

#> [1] 0.916
#> # A tibble: 6 × 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     draw
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>      <dbl>
#> 1          4.9         3.1          1.5         0.1 setosa         5
#> 2          5.1         3.8          1.9         0.4 setosa         5
#> 3          4.3         3            1.1         0.1 setosa         5
#> 4          5           3.2          1.2         0.2 setosa         5
#> 5          4.7         3.2          1.3         0.2 setosa         5
#> 6          6.4         3.2          4.5         1.5 versicolor     1