Search code examples
rdplyrsummarize

Using current value inside of a conditional summarise in dplyr


I am quite new to dplyr and can't figure what I am doing wrong. I have the following dataset:

   INSTRUMENT_USED Year UniqueCount
1         QUEST_A 2015           1
2         QUEST_A 2016           1
3         QUEST_A 2017           1
4         QUEST_A 2018           1
5         QUEST_A 2019           1
6         QUEST_A 2020           1
7         QUEST_A 2021           0
8         QUEST_A 2022           0
9         QUEST_A 2023           0
10        QUEST_B 2015           1
11        QUEST_B 2016           1
12        QUEST_B 2017           1
13        QUEST_B 2018           1
14        QUEST_B 2019           0
15        QUEST_B 2020           0
16        QUEST_B 2021           1
17        QUEST_B 2022           0
18        QUEST_B 2023           0

And I would like to create a variable 'dbreak' that indicates a break when the two following conditions are met:

  • UniqueCount was 1 for all previous years
  • UniqueCount is 0 for all years afterwards

So in this example above the "dbreak" variable should be "No" everywhere except for QUEST_A 2021 where it would be "Yes"

I somehow managed to make it work manually but as soon as I try to make summarise conditional on the variable Year it does not seem to work anymore.

When I run the code below I get the expected result

df <- df %>%
  group_by(INSTRUMENT_USED) %>%
  arrange(INSTRUMENT_USED,Year) %>%
  mutate(
    prev = lag(UniqueCount),
    dbreak = ifelse(UniqueCount==0 & prev == 1 & 
                      all(UniqueCount[Year <= 2020] == 1) &
                      all(UniqueCount[Year >= 2021] ==0), "YES", "No"))
   INSTRUMENT_USED  Year UniqueCount  prev dbreak
   <fct>           <dbl>       <dbl> <dbl> <chr> 
 1 QUEST_A         2015           1    NA No    
 2 QUEST_A         2016           1     1 No    
 3 QUEST_A         2017           1     1 No    
 4 QUEST_A         2018           1     1 No    
 5 QUEST_A         2019           1     1 No    
 6 QUEST_A         2020           1     1 No    
 7 QUEST_A         2021           0     1 Yes   
 8 QUEST_A         2022           0     0 No    
 9 QUEST_A         2023           0     0 No    
10 QUEST_B         2015           1    NA No    
11 QUEST_B         2016           1     1 No    
12 QUEST_B         2017           1     1 No    
13 QUEST_B         2018           1     1 No    
14 QUEST_B         2019           0     1 No    
15 QUEST_B         2020           0     0 No    
16 QUEST_B         2021           1     0 No    
17 QUEST_B         2022           0     1 No    
18 QUEST_B         2023           0     0 No    

But when I replace the hardcoded years by the Year variable it does not work anymore, or rather I can not find what is wrong with my condition. If I keep it exactly the same is evaluates all to "No", and if I modify to remove the '=' sign in the second all() statement it identifies two breaks for QUEST_B (2019 and 2022).

df<- df %>%
  group_by(INSTRUMENT_USED) %>%
  arrange(INSTRUMENT_USED,Year) %>%
  mutate(
    prev = lag(UniqueCount),
    dbreak = ifelse(UniqueCount==0 & prev == 1 & 
                      all(UniqueCount[Year <= Year-1] == 1) &
                      all(UniqueCount[Year > Year] ==0), "Yes", "No"))
   INSTRUMENT_USED  Year UniqueCount  prev dbreak
   <fct>           <dbl>       <dbl> <dbl> <chr> 
 1 QUEST_A         2015           1    NA No    
 2 QUEST_A         2016           1     1 No    
 3 QUEST_A         2017           1     1 No    
 4 QUEST_A         2018           1     1 No    
 5 QUEST_A         2019           1     1 No    
 6 QUEST_A         2020           1     1 No    
 7 QUEST_A         2021           0     1 Yes   
 8 QUEST_A         2022           0     0 No    
 9 QUEST_A         2023           0     0 No    
10 QUEST_B         2015           1    NA No    
11 QUEST_B         2016           1     1 No    
12 QUEST_B         2017           1     1 No    
13 QUEST_B         2018           1     1 No    
14 QUEST_B         2019           0     1 Yes    
15 QUEST_B         2020           0     0 No    
16 QUEST_B         2021           1     0 No    
17 QUEST_B         2022           0     1 Yes    
18 QUEST_B         2023           0     0 No    

Any idea?


Solution

  • I would calculate the leading difference as a helper column, and then test if (a) there is exactly one difference of 1, (b) all other differences are 0, and (c) the current row's difference is 1, if so 'yes' else 'no':

    df %>%
      group_by(INSTRUMENT_USED) %>%
      arrange(INSTRUMENT_USED,Year) %>%
      mutate(
        diff = UniqueCount - lead(UniqueCount, default = 0),
        dbreak = ifelse(
          sum(diff == 1) == 1 & sum(diff == 0) == (n() - 1) & diff == 1, 
          "Yes", "No"
        )
      )
    # # A tibble: 18 × 5
    # # Groups:   INSTRUMENT_USED [2]
    #    INSTRUMENT_USED  Year UniqueCount  diff dbreak
    #    <chr>           <int>       <int> <int> <chr> 
    #  1 QUEST_A          2015           1     0 No    
    #  2 QUEST_A          2016           1     0 No    
    #  3 QUEST_A          2017           1     0 No    
    #  4 QUEST_A          2018           1     0 No    
    #  5 QUEST_A          2019           1     0 No    
    #  6 QUEST_A          2020           1     1 Yes   
    #  7 QUEST_A          2021           0     0 No    
    #  8 QUEST_A          2022           0     0 No    
    #  9 QUEST_A          2023           0     0 No    
    # 10 QUEST_B          2015           1     0 No    
    # 11 QUEST_B          2016           1     0 No    
    # 12 QUEST_B          2017           1     0 No    
    # 13 QUEST_B          2018           1     1 No    
    # 14 QUEST_B          2019           0     0 No    
    # 15 QUEST_B          2020           0    -1 No    
    # 16 QUEST_B          2021           1     1 No    
    # 17 QUEST_B          2022           0     0 No    
    # 18 QUEST_B          2023           0     0 No