I have a dataset like:
ID DATETIME CODE Value
999 1/2/2024 16:22 TX 100
123 1/2/2024 16:47 IP 100
666 1/2/2024 17:13 IP 85
666 1/2/2024 17:38 IP 100
123 1/2/2024 18:03 TX 90
666 1/2/2024 18:28 TX 85
666 1/2/2024 18:54 IP 100
123 1/2/2024 19:19 CA 100
666 1/2/2024 19:44 OX 95
999 1/2/2024 20:09 18 75
123 1/2/2024 20:35 12 100
654 1/2/2024 21:00 IP 85
Here's, the Reprex of above:
structure(list(ID = c("999", "123", "666", "666", "123", "666",
"666", "123", "666", "999", "123", "654"), DATETIME = structure(c(1706804520,
1706806020, 1706807580, 1706809080, 1706810580, 1706812080, 1706813640,
1706815140, 1706816640, 1706818140, 1706819700, 1706821200), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), CODE = c("TX", "IP", "IP", "IP", "TX",
"TX", "IP", "CA", "OX", "18", "12", "IP"), Value = c(100, 100,
85, 100, 90, 85, 100, 100, 95, 75, 100, 85)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))
I'd like to add a column with the cumulative sum, by ID within the last 2 hours, according to a condition (CODE == IP). Like this:
ID DATETIME CODE Value cum_IP
999 1/2/2024 16:22 TX 100 0
123 1/2/2024 16:47 IP 100 100
666 1/2/2024 17:13 IP 85 85
666 1/2/2024 17:38 IP 100 185
123 1/2/2024 18:03 TX 90 0
666 1/2/2024 18:28 TX 85 0
666 1/2/2024 18:54 IP 100 285
123 1/2/2024 19:19 CA 100 0
666 1/2/2024 19:44 OX 95 0
999 1/2/2024 20:09 18 75 0
123 1/2/2024 20:35 12 100 0
654 1/2/2024 21:00 IP 85 85
I expect not to had made any error computing manually the column, but the idea is understandable. A function that computes an aggregated operation (a sum or a simple count) over rows with a grouping var (ID), a predicate (CODE == IP in this case, but it would be v.g. is.number(CODE)) and a window (2 hours from the current row).
Try the below. I'm choosing to group by CODE
as well to reduce the conditioning inside the code. We use sapply
as a simple rolling-window for the 2 hour window. I'm doing the singular if
conditional so that don't sapply(.)
through the data unnecessarily.
library(dplyr)
quux %>%
mutate(
cum_IP = if (first(CODE) == "IP") {
sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)]))
} else 0,
.by = c(ID, CODE))
# # A tibble: 12 × 5
# ID DATETIME CODE Value cum_IP
# <chr> <dttm> <chr> <dbl> <dbl>
# 1 999 2024-02-01 16:22:00.000 TX 100 0
# 2 123 2024-02-01 16:47:00.000 IP 100 100
# 3 666 2024-02-01 17:13:00.000 IP 85 85
# 4 666 2024-02-01 17:38:00.000 IP 100 185
# 5 123 2024-02-01 18:03:00.000 TX 90 0
# 6 666 2024-02-01 18:28:00.000 TX 85 0
# 7 666 2024-02-01 18:54:00.000 IP 100 285
# 8 123 2024-02-01 19:19:00.000 CA 100 0
# 9 666 2024-02-01 19:44:00.000 OX 95 0
# 10 999 2024-02-01 20:09:00.000 18 75 0
# 11 123 2024-02-01 20:35:00.000 12 100 0
# 12 654 2024-02-01 21:00:00.000 IP 85 85
.by=
requires dplyr_1.1.0
or newer, replace with group_by(..)
if on an older version.
(Since you tagged data.table.)
library(data.table)
as.data.table(quux) |>
_[, cum_IP := if (CODE == "IP") sapply(DATETIME, function(tm) sum(Value[between(DATETIME, tm-7200, tm)])) else 0,
by = .(ID, CODE)]
We can do CODE
instead of first(CODE)
because in data.table
when grouping by a variable, the inner expression only sees length-1 for the grouping variables.
The use of |> _[..]
requires R-4.3 or newer. Other options exist for pipeline data.table
for this.