Search code examples
rdplyr

Filter to select pairs of rows


I am trying to filter pairs (identified by id) based on conditions. I have the following data frame,

id <- c(1,1,2,2,1,1,3,3,4,4)

PorF <- c("start","fail","start","pass","start","pass","start","pass","start","fail")

timest <- c(as.POSIXct("2021-05-08 08:15:07"),
            as.POSIXct("2021-05-08 08:15:45"),
            as.POSIXct("2021-05-28 08:17:09"),
            as.POSIXct("2021-05-28 08:17:25"),
            as.POSIXct("2021-05-28 08:32:07"),
            as.POSIXct("2021-05-28 08:32:12"),
            as.POSIXct("2021-05-28 08:33:14"),
            as.POSIXct("2021-05-28 08:33:45"),
            as.POSIXct("2021-05-28 08:34:12"),
            as.POSIXct("2021-05-28 08:34:56"))


testdf <- data.frame(id, PorF, timest)
testdf

> testdf
   id  PorF              timest
1   1 start 2021-05-08 08:15:07
2   1  fail 2021-05-08 08:15:45
3   2 start 2021-05-28 08:17:09
4   2  pass 2021-05-28 08:17:25
5   1 start 2021-05-28 08:32:07
6   1  pass 2021-05-28 08:32:12
7   3 start 2021-05-28 08:33:14
8   3  pass 2021-05-28 08:33:45
9   4 start 2021-05-28 08:34:12
10  4  fail 2021-05-28 08:34:56

I want to filter those IDs that has a start and a pass. All the pairs with start and fail should be filtered out.

My expected output should look like this,

> filtered_testdf
id  PorF              timest
2 start 2021-05-28 08:17:09
2  pass 2021-05-28 08:17:25
1 start 2021-05-28 08:32:07
1  pass 2021-05-28 08:32:12
3 start 2021-05-28 08:33:14
3  pass 2021-05-28 08:33:45

What I am trying is this which is not giving what I want,

testdf |> 
  group_by(id) |> 
  filter(PorF == "start" & PorF == "pass")

Any idea how to achieve the expected result?


Solution

  • You can do:

    library(tidyverse)
    
    testdf |> 
      mutate(id_helper = cumsum(PorF == "start")) |> 
      filter(any(PorF == "start") & any(PorF == "pass"), .by = id_helper)
    

    Or equivalently:

    testdf |> 
      mutate(id_helper = cumsum(PorF == "start")) |> 
      filter(!any(PorF == "fail"), .by = id_helper)
    
      id  PorF              timest id_helper
    1  2 start 2021-05-28 08:17:09         2
    2  2  pass 2021-05-28 08:17:25         2
    3  1 start 2021-05-28 08:32:07         3
    4  1  pass 2021-05-28 08:32:12         3
    5  3 start 2021-05-28 08:33:14         4
    6  3  pass 2021-05-28 08:33:45         4
    

    Note that I had to create an additional id variable since (see my comment above) there are several pairs of rows with the same id (i.e. id 1 has two pairs).