Search code examples
rdata.tablecumulative-sum

rolling sum with conditions


I have a dataset like:

  ID    DATETIME    CODE  Value
999 1/2/2024 16:22  TX     100
123 1/2/2024 16:47  IP     100
666 1/2/2024 17:13  IP      85
666 1/2/2024 17:38  IP     100
123 1/2/2024 18:03  TX      90
666 1/2/2024 18:28  TX      85
666 1/2/2024 18:54  IP     100
123 1/2/2024 19:19  CA     100
666 1/2/2024 19:44  OX      95
999 1/2/2024 20:09  18      75
123 1/2/2024 20:35  12     100
654 1/2/2024 21:00  IP      85

Here's, the Reprex of above:

structure(list(ID = c("999", "123", "666", "666", "123", "666", 
"666", "123", "666", "999", "123", "654"), DATETIME = structure(c(1706804520, 
1706806020, 1706807580, 1706809080, 1706810580, 1706812080, 1706813640, 
1706815140, 1706816640, 1706818140, 1706819700, 1706821200), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), CODE = c("TX", "IP", "IP", "IP", "TX", 
"TX", "IP", "CA", "OX", "18", "12", "IP"), Value = c(100, 100, 
85, 100, 90, 85, 100, 100, 95, 75, 100, 85)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -12L))

I'd like to add a column with the cumulative sum, by ID within the last 2 hours, according to a condition (CODE == IP). Like this:

ID  DATETIME       CODE Value   cum_IP
999 1/2/2024 16:22  TX  100      0
123 1/2/2024 16:47  IP  100    100
666 1/2/2024 17:13  IP   85     85
666 1/2/2024 17:38  IP  100    185
123 1/2/2024 18:03  TX   90      0
666 1/2/2024 18:28  TX   85      0
666 1/2/2024 18:54  IP  100    285
123 1/2/2024 19:19  CA  100      0
666 1/2/2024 19:44  OX   95      0
999 1/2/2024 20:09  18   75      0
123 1/2/2024 20:35  12  100      0
654 1/2/2024 21:00  IP   85     85

I expect not to had made any error computing manually the column, but the idea is understandable. A function that computes an aggregated operation (a sum or a simple count) over rows with a grouping var (ID), a predicate (CODE == IP in this case, but it would be v.g. is.number(CODE)) and a window (2 hours from the current row).


Solution

  • Try the below. I'm choosing to group by CODE as well to reduce the conditioning inside the code. We use sapply as a simple rolling-window for the 2 hour window. I'm doing the singular if conditional so that don't sapply(.) through the data unnecessarily.

    dplyr

    library(dplyr)
    quux %>%
      mutate(
        cum_IP = if (first(CODE) == "IP") {
            sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)]))
          } else 0,
        .by = c(ID, CODE))
    # # A tibble: 12 × 5
    #    ID    DATETIME                CODE  Value cum_IP
    #    <chr> <dttm>                  <chr> <dbl>  <dbl>
    #  1 999   2024-02-01 16:22:00.000 TX      100      0
    #  2 123   2024-02-01 16:47:00.000 IP      100    100
    #  3 666   2024-02-01 17:13:00.000 IP       85     85
    #  4 666   2024-02-01 17:38:00.000 IP      100    185
    #  5 123   2024-02-01 18:03:00.000 TX       90      0
    #  6 666   2024-02-01 18:28:00.000 TX       85      0
    #  7 666   2024-02-01 18:54:00.000 IP      100    285
    #  8 123   2024-02-01 19:19:00.000 CA      100      0
    #  9 666   2024-02-01 19:44:00.000 OX       95      0
    # 10 999   2024-02-01 20:09:00.000 18       75      0
    # 11 123   2024-02-01 20:35:00.000 12      100      0
    # 12 654   2024-02-01 21:00:00.000 IP       85     85
    

    .by= requires dplyr_1.1.0 or newer, replace with group_by(..) if on an older version.

    data.table

    (Since you tagged .)

    library(data.table)
    as.data.table(quux) |>
      _[, cum_IP := if (CODE == "IP") sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)])) else 0,
        by = .(ID, CODE)]
    

    We can do CODE instead of first(CODE) because in data.table when grouping by a variable, the inner expression only sees length-1 for the grouping variables.

    The use of |> _[..] requires R-4.3 or newer. Other options exist for pipeline data.table for this.