Search code examples
rdplyr

Multiple conditions in filter


I have the following data,

library(tidyverse)

id <- c(1, 1, 2, 2, 3, 3, 1, 1, 3, 3, 3, 
        3, 4, 4, 4, 4, 4, 4)
status <- c("start", "fail",
            "start", "pass",
            "start", "fail",
            "start", "pass",
            "start", "fail",
            "start", "fail",
            "start", "fail",
            "start", "fail",
            "start", "pass")
df <- data.frame(id, status)

> df
   id status
1   1  start
2   1   fail
3   2  start
4   2   pass
5   3  start
6   3   fail
7   1  start
8   1   pass
9   3  start
10  3   fail
11  3  start
12  3   fail
13  4  start
14  4   fail
15  4  start
16  4   fail
17  4  start
18  4   pass

Let's say it is data for a class where students try to pass the exams. Some students pass in the first trial (e.g. student id = 2), some require multiple trials (e.g. student id tried two times to pass) and some students failed even after 3 trials.

I want to filter out those students who failed after 3 trials (e.g. student id 3).

I have the following code,

> df |> 
+   group_by(id) |> 
+   filter(n() == 6 & status == "fail")
# A tibble: 5 × 2
# Groups:   id [2]
     id status
  <dbl> <chr> 
1     3 fail  
2     3 fail  
3     3 fail  
4     4 fail  
5     4 fail  

This is not what I want. Student 4 passed in 3rd trial. So student 4 should not be in the list. Any idea how to get the desired output? An output like the following should be better:

     id status
  <dbl> <chr> 
1     3 fail  

Solution

  • We can remove all the "start" status to keep only pass/fail rows. Then keep only those id's where there are >= 3 rows for them and all of them are fail.

    library(dplyr)
    
    df %>%
      filter(status != "start") %>%
      filter(n() >= 3 & all(status == "fail"), .by = id) %>%
      distinct()
    
    #  id status
    #1  3   fail
    

    An alternative way using only one filter statement :

    df %>%
      filter(all(status != "pass") & status == "fail", .by = id) %>%
      distinct()