Search code examples
rdplyr

Issue with dplyr filter in R when attempting to filter based on probability


Question: Can you really reliably perform filtering based on a draw inside of the filter, or is it doomed to fail?

Objective: Inside of dplyr::filter(), remove rows of type A (versicolor) if a random draw fails, and remove rows of type B (virginica) if the draw passes. Should always finish with 5 random rows of setosa data, and 1 random row of either virginica (1/8 chance) or versicolor (7/8 chance).

Attempt:

as_tibble(iris) %>%
   group_by(Species) %>%
   mutate(draw = case_when(
      Species == "setosa" ~ 5,
      Species == "versicolor" ~ 1,
      Species == "virginica" ~ 1
   )) %>%
   slice(sample(n(),draw[1])) %>%
   filter(
      if(round(runif(1),3) <= 1/8){ Species != "versicolor" }
      else { Species != "virginica" }
   )

Problem: While this typically works by returning only one or the other, I will occasionally get both or neither. Out of curiosity, it appears that I get both 11% of the time, neither 11% of the time, and just one (which is correct) only 78% of the time.

I understand that one solution would be to end the pipe after the slice, then do the draw inside an if() statement:

data <- as_tibble(iris) %>%
   group_by(Species) %>%
   mutate(draw = case_when(
      Species == "setosa" ~ 5,
      Species == "versicolor" ~ 1,
      Species == "virginica" ~ 1
   )) %>%
   slice(sample(n(),draw[1]))

if(round(runif(1),3) <= 1/8){
   data %>%
      filter(Species != "versicolor")
}
else {
   data %>%
      filter(Species != "virginica")
}

However, since I realized that it is possible to perform the draw inside the filter, I am interested in learning if it is practical.


Solution

  • Why don't you calculate the probability before?

    as_tibble(iris) %>%
      mutate(prob = runif(1)) %>% 
      group_by(Species) %>%
      mutate(draw = case_when(
        Species == "setosa" ~ 5,
        Species == "versicolor" ~ +(prob <= 1/8),
        Species == "virginica" ~ +(prob > 1/8)
      )) %>%
      slice(sample(n(),draw[1]))
    

    It looks cleaner then the if ... else at the end.


    The reason why sometimes it print neither or both is because you don't extract 1 random number, but 3! One for each group.

    filter in your code is being applied tp a groupped data.frame, so you get a random number per each group.

    Try to run this code. You will see it will print 3 numbers.

    • The condition for setosa doesn't matter.
    • The random number for versicolor is 0.095 which will set the condition to TRUE and eliminate versicolor from the filter condition.
    • The random number for virginica is 0.295 which is greater than 1/8. The condition is FALSE and virginica gets erased from the final result.
    print_random_number <- function(x, Species){
      cat(sprintf("%10s: %s <= 1/8 --> %s\n", Species, x, x <= 1/8))
      x
    }
    
    set.seed(6)
    as_tibble(iris) %>%
      group_by(Species) %>%
      mutate(draw = case_when(
        Species == "setosa" ~ 5,
        Species == "versicolor" ~ 1,
        Species == "virginica" ~ 1
      )) %>%
      slice(sample(n(),draw[1])) %>%
      filter(
        if(print_random_number(round(runif(1),3), Species[1]) <= 1/8){ Species != "versicolor" }
        else { Species != "virginica" }
      )
    
    #>     setosa: 0.916 <= 1/8 --> FALSE
    #> versicolor: 0.095 <= 1/8 --> TRUE
    #>  virginica: 0.295 <= 1/8 --> FALSE
    #> # A tibble: 5 × 6
    #> # Groups:   Species [1]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  draw
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
    #> 1          4.9         3.1          1.5         0.1 setosa      5
    #> 2          5.1         3.8          1.9         0.4 setosa      5
    #> 3          4.3         3            1.1         0.1 setosa      5
    #> 4          5           3.2          1.2         0.2 setosa      5
    #> 5          4.7         3.2          1.3         0.2 setosa      5
    

    Therefore, you could correct your code simply by adding ungroup before filter:

    set.seed(6)
    as_tibble(iris) %>%
      group_by(Species) %>%
      mutate(draw = case_when(
        Species == "setosa" ~ 5,
        Species == "versicolor" ~ 1,
        Species == "virginica" ~ 1
      )) %>%
      slice(sample(n(),draw[1])) %>%
      ungroup() %>%
      filter(
        # I added a print statement here to see what number was pulled
        if(print(round(runif(1),3)) <= 1/8){ Species != "versicolor" }
        else { Species != "virginica" }
      )
    
    #> [1] 0.916
    #> # A tibble: 6 × 6
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     draw
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>      <dbl>
    #> 1          4.9         3.1          1.5         0.1 setosa         5
    #> 2          5.1         3.8          1.9         0.4 setosa         5
    #> 3          4.3         3            1.1         0.1 setosa         5
    #> 4          5           3.2          1.2         0.2 setosa         5
    #> 5          4.7         3.2          1.3         0.2 setosa         5
    #> 6          6.4         3.2          4.5         1.5 versicolor     1