I have a df:
df <- data.frame(nr = c(rep("n01", 10),
rep("n03", 13),
rep("n04", 8),
rep("n06", 14),
rep("n08", 13),
rep("n12", 14)),
yr = c(2012:2021,
2010:2022,
2013:2020,
2010:2023,
2011:2023,
2009:2022),
md = c(rep("no", 5),
rep("yes", 8),
rep("no", 13),
rep("yes", 1),
rep("no", 9),
rep("yes", 5),
rep("no", 20),
rep("yes", 2),
rep("no", 9)))
I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.
You can try the following dplyr
approach, which uses cumsum
to look at the total number of "yes" by nr
then scans for any "no" after the last yes (using max
). Then filter
s and removes temp variables
library(dplyr)
df %>%
mutate(temp = cumsum(md %in% "yes"),
temp2 = temp == max(temp) & md %in% "no",
.by = nr) %>%
filter(any(temp2) & temp >= 2, .by = nr) %>%
select(-starts_with("temp"))
Note you could do all this in one messy filter
statement and remove the need to create temp
variables:
df %>%
filter(any(cumsum(md %in% "yes") == max(cumsum(md %in% "yes")) &
md %in% "no") &
cumsum(md %in% "yes") >= 2,
.by = nr)
Output:
nr yr md
1 n03 2011 yes
2 n03 2012 yes
3 n03 2013 no
4 n03 2014 no
5 n03 2015 no
6 n03 2016 no
7 n03 2017 no
8 n03 2018 no
9 n03 2019 no
10 n03 2020 no
11 n03 2021 no
12 n03 2022 no
13 n06 2016 yes
14 n06 2017 yes
15 n06 2018 yes
16 n06 2019 yes
17 n06 2020 no
18 n06 2021 no
19 n06 2022 no
20 n06 2023 no
21 n12 2013 yes
22 n12 2014 no
23 n12 2015 no
24 n12 2016 no
25 n12 2017 no
26 n12 2018 no
27 n12 2019 no
28 n12 2020 no
29 n12 2021 no
30 n12 2022 no