Search code examples
rdplyr

Use data from multiple years (=rows) to filter another column


I have a df:

df <- data.frame(nr = c(rep("n01", 10),
                    rep("n03", 13),
                    rep("n04", 8),
                    rep("n06", 14),
                    rep("n08", 13),
                    rep("n12", 14)),
             yr = c(2012:2021, 
                    2010:2022,
                    2013:2020,
                    2010:2023,
                    2011:2023,
                    2009:2022),
             md = c(rep("no", 5),
                    rep("yes", 8),
                    rep("no", 13),
                    rep("yes", 1),
                    rep("no", 9),
                    rep("yes", 5),
                    rep("no", 20),
                    rep("yes", 2),
                    rep("no", 9)))

I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.


Solution

  • You can try the following dplyr approach, which uses cumsum to look at the total number of "yes" by nr then scans for any "no" after the last yes (using max). Then filters and removes temp variables

    library(dplyr)
    
    df %>%
      mutate(temp = cumsum(md %in% "yes"),
             temp2 = temp == max(temp) & md %in% "no",
             .by = nr) %>%
      filter(any(temp2) & temp >= 2, .by = nr) %>% 
      select(-starts_with("temp"))
    

    Note you could do all this in one messy filter statement and remove the need to create temp variables:

    df %>%
      filter(any(cumsum(md %in% "yes") == max(cumsum(md %in% "yes")) &
                   md %in% "no") & 
               cumsum(md %in% "yes") >= 2, 
             .by = nr) 
    

    Output:

       nr   yr  md
    1  n03 2011 yes
    2  n03 2012 yes
    3  n03 2013  no
    4  n03 2014  no
    5  n03 2015  no
    6  n03 2016  no
    7  n03 2017  no
    8  n03 2018  no
    9  n03 2019  no
    10 n03 2020  no
    11 n03 2021  no
    12 n03 2022  no
    13 n06 2016 yes
    14 n06 2017 yes
    15 n06 2018 yes
    16 n06 2019 yes
    17 n06 2020  no
    18 n06 2021  no
    19 n06 2022  no
    20 n06 2023  no
    21 n12 2013 yes
    22 n12 2014  no
    23 n12 2015  no
    24 n12 2016  no
    25 n12 2017  no
    26 n12 2018  no
    27 n12 2019  no
    28 n12 2020  no
    29 n12 2021  no
    30 n12 2022  no