Search code examples
rcumulative-sumcumsum

Cumulative summing between events in R


I know cumulative summing is fairly easy in R, but I am stuck on this particular use case.

On the below table, I would like a 4th column that is the cumulative sum between any individual's successive acceptances, with the counter reset at every instance of their accepting.

So, for example, user one's data would have row 1 as 0, this is the very first instance of any of their acceptances, all users should have 0 on their very first acceptance. Row 2 would have 1 (one day between successive invitations), row 3 would have 21 (20 from their last + 1 from the initial) but their entry on row 5 would have 53 because the counter was reset on their last acceptance. It should not display a zero on subsequent acceptances, that zero should be implied. The only zero should be their very first.

  user  status_name invitationDate
  <fct> <fct>       <date>        
1 1     Accepted    2021-09-09    
2 1     Declined    2021-09-10    
3 1     Accepted    2021-09-30    
4 4     Accepted    2021-11-10    
5 1     Accepted    2021-11-22    
6 4     Declined    2021-11-29  

I have included the code to recreate the table below.

df <- tribble(
  ~user, ~status_name, ~invitationDate,
  "1", "Declined", "2021-07-13",
  "4", "Declined", "2021-07-31",
  "1", "Accepted", "2021-09-09",
  "1", "Declined", "2021-09-10",
  "1", "Accepted", "2021-09-30",
  "4", "Accepted", "2021-11-10",
  "3", "Declined", "2021-11-12",
  "2", "Declined", "2021-11-18",
  "1", "Accepted", "2021-11-22",
  "4", "Declined", "2021-11-29"
) %>%
  mutate(
    user = as.factor(user),
    status_name = as.factor(status_name),
    invitationDate = as.Date(invitationDate, format = "%Y-%m-%d")
  ) %>%
  group_by(user) %>%
  mutate(cumsum = cumsum(status_name == "Accepted")) %>%
  filter(cumsum > 0) %>%
  select(-cumsum)

Solution

  • Note that the input has already been grouped by user so use cumsum to mark the first Accepted and following Declined rows as cs=1, the next Accepted and following Declined rows as cs=2 and so on. Then match each cs to the first row encountered having a cs value of one minus the current cs value giving ix and subtract the date at row ix from the date on the current row. If you prefer to leave df as grouped omit the ungroup line.

    df %>%
      mutate(cs = cumsum(status_name == "Accepted"),
        ix = match(cs - 1, cs, nomatch = 1),
        dif = as.numeric(invitationDate - invitationDate[ix])) %>%
      ungroup %>%
      select(-cs, -ix)
    

    giving

    # A tibble: 6 × 4
      user  status_name invitationDate   dif
      <fct> <fct>       <date>         <dbl>
    1 1     Accepted    2021-09-09         0
    2 1     Declined    2021-09-10         1
    3 1     Accepted    2021-09-30        21
    4 4     Accepted    2021-11-10         0
    5 1     Accepted    2021-11-22        53
    6 4     Declined    2021-11-29        19
    

    A variation of the above is to combine the cs and ix calculation into a priorTRUE function which accepts a logical vector and for each component returns the index of the prior TRUE value. This avoids introducing temporary columns cs and ix and the corresponding select. Similarly, omit the ungroup line if it is desired to retain the grouping.

    priorTRUE <- function(x) {
      cs <- cumsum(x)
      match(cs - 1, cs, nomatch = 1)
    }
    
    df %>%
      mutate(dif = as.numeric(
        invitationDate - invitationDate[priorTRUE(status_name == "Accepted")])) %>%
      ungroup