I have the following data,
library(tidyverse)
id <- c(1, 1, 2, 2, 3, 3, 1, 1, 3, 3, 3,
3, 4, 4, 4, 4, 4, 4)
status <- c("start", "fail",
"start", "pass",
"start", "fail",
"start", "pass",
"start", "fail",
"start", "fail",
"start", "fail",
"start", "fail",
"start", "pass")
df <- data.frame(id, status)
> df
id status
1 1 start
2 1 fail
3 2 start
4 2 pass
5 3 start
6 3 fail
7 1 start
8 1 pass
9 3 start
10 3 fail
11 3 start
12 3 fail
13 4 start
14 4 fail
15 4 start
16 4 fail
17 4 start
18 4 pass
Let's say it is data for a class where students try to pass the exams. Some students pass in the first trial (e.g. student id = 2), some require multiple trials (e.g. student id tried two times to pass) and some students failed even after 3 trials.
I want to filter out those students who failed after 3 trials (e.g. student id 3).
I have the following code,
> df |>
+ group_by(id) |>
+ filter(n() == 6 & status == "fail")
# A tibble: 5 × 2
# Groups: id [2]
id status
<dbl> <chr>
1 3 fail
2 3 fail
3 3 fail
4 4 fail
5 4 fail
This is not what I want. Student 4 passed in 3rd trial. So student 4 should not be in the list. Any idea how to get the desired output? An output like the following should be better:
id status
<dbl> <chr>
1 3 fail
We can remove all the "start" status to keep only pass/fail rows. Then keep only those id's where there are >= 3 rows for them and all of them are fail.
library(dplyr)
df %>%
filter(status != "start") %>%
filter(n() >= 3 & all(status == "fail"), .by = id) %>%
distinct()
# id status
#1 3 fail
An alternative way using only one filter
statement :
df %>%
filter(all(status != "pass") & status == "fail", .by = id) %>%
distinct()