Search code examples
rdatedplyrlubridate

r - Generating cumulative sum, total sum, and unique identifiers between start and end dates


What I want to do

I have a dataset of protest events in the United States. Some events are stand-alone events, while others persist day-after-day (a "multi-day event"). My dataset is structured at the daily level, so a three-day multi-day event is spread out over three rows.

I want to accomplish the following:

  1. Create a cumulative sum of the number of days thus far in any given multi-day event. Specifically, I want to count the number of days between the "First day" and "Last day" of any linked event.
  2. Put the total number of days of each multi-event as a variable
  3. "Name" each multi-day event by concatenating the state in which the protest occurred and a sequential identity number starting at 1 in each state and extending upwards.

Data

Here's a reproducible example:

# Library
library(tidyverse) # Brings in dplyr

# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
                   date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15", 
                            "2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
                            "2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
                            "2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))

# Now create some lags and leads
test <- test %>%
  group_by(state) %>%
  mutate(date_lag = lag(date),
         date_lead = lead(date),
         days_last = date - date_lag,
         days_next = date_lead - date,
         link_last = if_else(days_last <= 1, 1, 0),
         link_next = if_else(days_next <= 1, 1, 0),
         sequence = if_else(link_last == 0 & link_next == 1, "First day",
                            if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
                                    if_else(link_last == 1 & link_next == 1, "Ongoing",
                                            if_else(link_last == 1 & link_next == 0, "Last day", 
                                                    if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))

This generates the following dataframe:

   state      date       date_lag   date_lead  days_last days_next link_last link_next sequence  
   <chr>      <date>     <date>     <date>     <drtn>    <drtn>        <dbl>     <dbl> <chr>     
 1 Washington 2021-01-01 NA         2021-01-03 NA days    2 days          NA         0 NA        
 2 Washington 2021-01-03 2021-01-01 2021-01-04  2 days    1 days           0         1 First day 
 3 Washington 2021-01-04 2021-01-03 2021-01-10  1 days    6 days           1         0 Last day  
 4 Washington 2021-01-10 2021-01-04 2021-01-15  6 days    5 days           0         0 Not linked
 5 Washington 2021-01-15 2021-01-10 2021-01-16  5 days    1 days           0         1 First day 
 6 Washington 2021-01-16 2021-01-15 2021-01-17  1 days    1 days           1         1 Ongoing   
 7 Washington 2021-01-17 2021-01-16 2021-01-18  1 days    1 days           1         1 Ongoing   
 8 Washington 2021-01-18 2021-01-17 2021-01-19  1 days    1 days           1         1 Ongoing   
 9 Washington 2021-01-19 2021-01-18 2021-01-28  1 days    9 days           1         0 Last day  
10 Washington 2021-01-28 2021-01-19 NA          9 days   NA days           0        NA NA        
11 Idaho      2021-01-12 NA         2021-01-13 NA days    1 days          NA         1 NA        
12 Idaho      2021-01-13 2021-01-12 2021-01-14  1 days    1 days           1         1 Ongoing   
13 Idaho      2021-01-14 2021-01-13 2021-02-01  1 days   18 days           1         0 Last day  
14 Idaho      2021-02-01 2021-01-14 2021-02-03 18 days    2 days           0         0 Not linked
15 Idaho      2021-02-03 2021-02-01 2021-02-04  2 days    1 days           0         1 First day 
16 Idaho      2021-02-04 2021-02-03 2021-02-05  1 days    1 days           1         1 Ongoing   
17 Idaho      2021-02-05 2021-02-04 2021-02-08  1 days    3 days           1         0 Last day  
18 Idaho      2021-02-08 2021-02-05 2021-02-10  3 days    2 days           0         0 Not linked
19 Idaho      2021-02-10 2021-02-08 2021-02-14  2 days    4 days           0         0 Not linked
20 Idaho      2021-02-14 2021-02-10 NA          4 days   NA days           0        NA NA    

What I want to create:

   state      date       date_lag   date_lead  days_last days_next link_last link_next sequence   cumulative duration name        
   <chr>      <date>     <date>     <date>     <drtn>    <drtn>        <dbl>     <dbl> <chr>           <dbl>    <dbl> <chr>       
 1 Washington 2021-01-01 NA         2021-01-03 NA days    2 days          NA         0 NA                 NA        0 NA          
 2 Washington 2021-01-03 2021-01-01 2021-01-04  2 days    1 days           0         1 First day           1        2 Washington.1
 3 Washington 2021-01-04 2021-01-03 2021-01-10  1 days    6 days           1         0 Last day            2        2 Washington.1
 4 Washington 2021-01-10 2021-01-04 2021-01-15  6 days    5 days           0         0 Not linked         NA        0 NA          
 5 Washington 2021-01-15 2021-01-10 2021-01-16  5 days    1 days           0         1 First day           1        5 Washington.2
 6 Washington 2021-01-16 2021-01-15 2021-01-17  1 days    1 days           1         1 Ongoing             2        5 Washington.2
 7 Washington 2021-01-17 2021-01-16 2021-01-18  1 days    1 days           1         1 Ongoing             3        5 Washington.2
 8 Washington 2021-01-18 2021-01-17 2021-01-19  1 days    1 days           1         1 Ongoing             4        5 Washington.2
 9 Washington 2021-01-19 2021-01-18 2021-01-28  1 days    9 days           1         0 Last day            5        5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA          9 days   NA days           0        NA NA                 NA       NA NA          
11 Idaho      2021-01-12 NA         2021-01-13 NA days    1 days          NA         1 NA                  1        3 Idaho.1     
12 Idaho      2021-01-13 2021-01-12 2021-01-14  1 days    1 days           1         1 Ongoing             2        3 Idaho.1     
13 Idaho      2021-01-14 2021-01-13 2021-02-01  1 days   18 days           1         0 Last day            3        3 Idaho.1     
14 Idaho      2021-02-01 2021-01-14 2021-02-03 18 days    2 days           0         0 Not linked         NA       NA NA          
15 Idaho      2021-02-03 2021-02-01 2021-02-04  2 days    1 days           0         1 First day           1        3 Idaho.2     
16 Idaho      2021-02-04 2021-02-03 2021-02-05  1 days    1 days           1         1 Ongoing             2        3 Idaho.2     
17 Idaho      2021-02-05 2021-02-04 2021-02-08  1 days    3 days           1         0 Last day            3        3 Idaho.2     
18 Idaho      2021-02-08 2021-02-05 2021-02-10  3 days    2 days           0         0 Not linked         NA       NA NA          
19 Idaho      2021-02-10 2021-02-08 2021-02-14  2 days    4 days           0         0 Not linked         NA       NA NA          
20 Idaho      2021-02-14 2021-02-10 NA          4 days   NA days           0        NA NA                 NA       NA NA  

Side question: Why is test$sequence[11] an NA and not "First day"?


Solution

  • I'm not sure these are the specific numbers you're looking for, but this represents what seems to me a simpler and more idiomatic tidyverse approach:

    test %>%
      group_by(state) %>%
      mutate(days_last = as.numeric(date - lag(date)),
             new_section = 1*(is.na(days_last) | days_last > 1),   # EDIT
             section = cumsum(new_section),
             name = paste(state,section, sep = ".")) %>%
      group_by(name) %>%
      mutate(duration = as.numeric(max(date) - min(date) + 1),
         sequence = case_when(duration == 1 ~ "Unlinked",
                              row_number() == 1 ~ "First Day",
                              row_number() == n() ~ "Last Day",
                              TRUE ~ "Ongoing")) %>%
      ungroup()
    

    Here, I mark any gap of more than one day as a new event, take the cumulative sum, and use that to define the duration of each event.

    # A tibble: 20 x 8
       state      date       days_last new_section section name         duration sequence 
       <chr>      <date>         <dbl>       <dbl>   <dbl> <chr>           <dbl> <chr>    
     1 Washington 2021-01-01        NA           1       1 Washington.1        1 Unlinked 
     2 Washington 2021-01-03         2           1       2 Washington.2        2 First Day
     3 Washington 2021-01-04         1           0       2 Washington.2        2 Last Day 
     4 Washington 2021-01-10         6           1       3 Washington.3        1 Unlinked 
     5 Washington 2021-01-15         5           1       4 Washington.4        5 First Day
     6 Washington 2021-01-16         1           0       4 Washington.4        5 Ongoing  
     7 Washington 2021-01-17         1           0       4 Washington.4        5 Ongoing  
     8 Washington 2021-01-18         1           0       4 Washington.4        5 Ongoing  
     9 Washington 2021-01-19         1           0       4 Washington.4        5 Last Day 
    10 Washington 2021-01-28         9           1       5 Washington.5        1 Unlinked 
    11 Idaho      2021-01-12        NA           1       1 Idaho.1             3 First Day
    12 Idaho      2021-01-13         1           0       1 Idaho.1             3 Ongoing  
    13 Idaho      2021-01-14         1           0       1 Idaho.1             3 Last Day 
    14 Idaho      2021-02-01        18           1       2 Idaho.2             1 Unlinked 
    15 Idaho      2021-02-03         2           1       3 Idaho.3             3 First Day
    16 Idaho      2021-02-04         1           0       3 Idaho.3             3 Ongoing  
    17 Idaho      2021-02-05         1           0       3 Idaho.3             3 Last Day 
    18 Idaho      2021-02-08         3           1       4 Idaho.4             1 Unlinked 
    19 Idaho      2021-02-10         2           1       5 Idaho.5             1 Unlinked 
    20 Idaho      2021-02-14         4           1       6 Idaho.6             1 Unlinked