Search code examples
rif-statementdplyrdata-manipulation

Using if_else in dplyr to create a count of rows that resets to 1 within a group


I am attempting to create a count of rows based on conditions where, when the condition is not met, the value is reset to 0, rather than continuing the count. In addition, I am trying to reset the count to 1 when the condition is met again. I am grouping based on id to prevent counts from spilling over to other cross-sectional units. Here is an example of what it looks like:

# A tibble: 5 × 4
#  ccode  year    id civ_int
#  <dbl> <dbl> <dbl>   <dbl>
#1    90  1967     1       0
#2    90  1968     1       0
#3    90  1969     1       0
#4    90  1970     1       0
#5    90  1971     1       0

The issue I am running into is that, within id, counts are not resetting to 1. Instead, they are continuing the count when civ_int returns to 0. For example, the count may have reached 22, in which case it resets to 0 when civ_int = 1. However, when civ_int returns to 0, the count picks up at 23. Below is my syntax for how I have been approaching this for reference:

merged <- merged %>%
  mutate(civ_int = if_else(
    deaths >= 25, 1, 0
  )) %>%
  group_by(id) %>%
  mutate(low_years = as.numeric(row_number()
  )) %>%
  mutate(low_years = cumsum(if_else(
    civ_int == 0, 1, 0
  ))) %>%
  mutate(low_years = if_else(
    civ_int == 1, 0, low_years
  )) %>%
  ungroup()

And here is an example of the issues I get with this code:

# A tibble: 20 × 5
#      id  year deaths civ_int low_years
#   <dbl> <dbl>  <dbl>   <dbl>     <dbl>
# 1     1  1983      0       0        17
# 2     1  1984      0       0        18
# 3     1  1985      0       0        19
# 4     1  1986      0       0        20
# 5     1  1987      0       0        21
# 6     1  1988      0       0        22
# 7     1  1989    363       1         0
# 8     1  1990    522       1         0
# 9     1  1991    308       1         0
#10     1  1992    273       1         0
#11     1  1993    132       1         0
#12     1  1994    226       1         0
#13     1  1995     74       1         0
#14     1  1996      2       0        23
#15     1  1997      2       0        24
#16     1  1998      1       0        25
#17     1  1999      0       0        26
#18     1  2000      0       0        27
#19     1  2001      0       0        28
#20     1  2002      2       0        29

low_years should reset to 1 in 1996 and count upwards from there, but this does not happen. Any ideas?


Solution

  • Introducing an additional grouping value may work for you

    library(dplyr)
    
    df %>%
      mutate(civ_int = if_else(deaths >= 25, 1, 0)) %>%
      group_by(id, grp = cumsum(civ_int != lag(civ_int, default=1))) %>% 
      mutate(low_years = cumsum(civ_int == 0)) %>% 
      ungroup() %>% 
      select(-grp)
    # A tibble: 20 × 5
          id  year deaths civ_int low_years
       <int> <int>  <int>   <int>     <int>
     1     1  1983      0       0         1
     2     1  1984      0       0         2
     3     1  1985      0       0         3
     4     1  1986      0       0         4
     5     1  1987      0       0         5
     6     1  1988      0       0         6
     7     1  1989    363       1         0
     8     1  1990    522       1         0
     9     1  1991    308       1         0
    10     1  1992    273       1         0
    11     1  1993    132       1         0
    12     1  1994    226       1         0
    13     1  1995     74       1         0
    14     1  1996      2       0         1
    15     1  1997      2       0         2
    16     1  1998      1       0         3
    17     1  1999      0       0         4
    18     1  2000      0       0         5
    19     1  2001      0       0         6
    20     1  2002      2       0         7
    

    Data

    df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), year = 1983:2002, deaths = c(0L, 
    0L, 0L, 0L, 0L, 0L, 363L, 522L, 308L, 273L, 132L, 226L, 74L, 
    2L, 2L, 1L, 0L, 0L, 0L, 2L)), class = "data.frame", row.names = c(NA, 
    -20L))