I know cumulative summing is fairly easy in R, but I am stuck on this particular use case.
On the below table, I would like a 4th column that is the cumulative sum between any individual's successive acceptances, with the counter reset at every instance of their accepting.
So, for example, user one's data would have row 1 as 0, this is the very first instance of any of their acceptances, all users should have 0 on their very first acceptance. Row 2 would have 1 (one day between successive invitations), row 3 would have 21 (20 from their last + 1 from the initial) but their entry on row 5 would have 53 because the counter was reset on their last acceptance. It should not display a zero on subsequent acceptances, that zero should be implied. The only zero should be their very first.
user status_name invitationDate
<fct> <fct> <date>
1 1 Accepted 2021-09-09
2 1 Declined 2021-09-10
3 1 Accepted 2021-09-30
4 4 Accepted 2021-11-10
5 1 Accepted 2021-11-22
6 4 Declined 2021-11-29
I have included the code to recreate the table below.
df <- tribble(
~user, ~status_name, ~invitationDate,
"1", "Declined", "2021-07-13",
"4", "Declined", "2021-07-31",
"1", "Accepted", "2021-09-09",
"1", "Declined", "2021-09-10",
"1", "Accepted", "2021-09-30",
"4", "Accepted", "2021-11-10",
"3", "Declined", "2021-11-12",
"2", "Declined", "2021-11-18",
"1", "Accepted", "2021-11-22",
"4", "Declined", "2021-11-29"
) %>%
mutate(
user = as.factor(user),
status_name = as.factor(status_name),
invitationDate = as.Date(invitationDate, format = "%Y-%m-%d")
) %>%
group_by(user) %>%
mutate(cumsum = cumsum(status_name == "Accepted")) %>%
filter(cumsum > 0) %>%
select(-cumsum)
Note that the input has already been grouped by user
so use cumsum
to mark the first Accepted
and following Declined
rows as cs=1
, the next Accepted
and following Declined
rows as cs=2
and so on. Then match each cs
to the first row encountered having a cs
value of one minus the current cs
value giving ix
and subtract the date at row ix
from the date on the current row. If you prefer to leave df
as grouped omit the ungroup
line.
df %>%
mutate(cs = cumsum(status_name == "Accepted"),
ix = match(cs - 1, cs, nomatch = 1),
dif = as.numeric(invitationDate - invitationDate[ix])) %>%
ungroup %>%
select(-cs, -ix)
giving
# A tibble: 6 × 4
user status_name invitationDate dif
<fct> <fct> <date> <dbl>
1 1 Accepted 2021-09-09 0
2 1 Declined 2021-09-10 1
3 1 Accepted 2021-09-30 21
4 4 Accepted 2021-11-10 0
5 1 Accepted 2021-11-22 53
6 4 Declined 2021-11-29 19
A variation of the above is to combine the cs
and ix
calculation into a priorTRUE
function which accepts a logical vector and for each component returns the index of the prior TRUE value. This avoids introducing temporary columns cs
and ix
and the corresponding select
. Similarly, omit the ungroup
line if it is desired to retain the grouping.
priorTRUE <- function(x) {
cs <- cumsum(x)
match(cs - 1, cs, nomatch = 1)
}
df %>%
mutate(dif = as.numeric(
invitationDate - invitationDate[priorTRUE(status_name == "Accepted")])) %>%
ungroup