I have a tabular log file with the (simplified) structure: <time>, <event_tag>
and want to find intervals between two different event_tags
, problem
and all fine
. The difficulty is that all fine
and also problem
is repeated regularly.
So the algorithm by hand would be to find the first problem
and look for the next all fine
and proceed until the last problem
and subsequent all fine
.
Sample dataset:
library(data.table)
set.seed(156125)
DT <- data.table(time = seq(as.POSIXct(tz = "UTC", "2024-01-01"),
as.POSIXct(tz = "UTC", "2024-01-10"),
by = "12 hours"),
event_tag = c("problem", "all fine")[round(runif(19, 1.2, 2.49))])
# time event_tag
# <POSc> <char>
# 1: 2024-01-01 00:00:00 all fine
# 2: 2024-01-01 12:00:00 all fine
# 3: 2024-01-02 00:00:00 problem
# 4: 2024-01-02 12:00:00 all fine
# 5: 2024-01-03 00:00:00 all fine
# 6: 2024-01-03 12:00:00 problem
# 7: 2024-01-04 00:00:00 problem
# 8: 2024-01-04 12:00:00 all fine
# 9: 2024-01-05 00:00:00 all fine
# 10: 2024-01-05 12:00:00 all fine
# 11: 2024-01-06 00:00:00 all fine
# 12: 2024-01-06 12:00:00 all fine
# 13: 2024-01-07 00:00:00 problem
# 14: 2024-01-07 12:00:00 all fine
# 15: 2024-01-08 00:00:00 all fine
# 16: 2024-01-08 12:00:00 problem
# 17: 2024-01-09 00:00:00 all fine
# 18: 2024-01-09 12:00:00 all fine
# 19: 2024-01-10 00:00:00 all fine
Desired result:
data.table(problem_start = DT$time[c(3, 6, 13, 16)],
problem_end = DT$time[c(4, 8, 14, 17)])
# problem_start problem_end
# <POSc> <POSc>
# 1: 2024-01-02 00:00:00 2024-01-02 12:00:00
# 2: 2024-01-03 12:00:00 2024-01-04 12:00:00
# 3: 2024-01-07 00:00:00 2024-01-07 12:00:00
# 4: 2024-01-08 12:00:00 2024-01-09 00:00:00
I thought about some solution by making the two tags boolean
and use cumsum
, but could not figure it out completely. And maybe there is a neat data.table
-way to do it which I don't see at the moment. However, I would also be happy with a dplyr
solution even though I prefer data.table
.
DT[ , bool := ifelse(event_tag == "all fine", 0, 1)]
DT[ , cumsum(bool)]
Here is a different data.table
approach (although I prefer the rolling join solution)
DT[DT[, .I[1], by = rleid(event_tag)]$V1
][, end_time := shift(time, type = "lead")
][event_tag == "problem", ]
Functionality:
rleid()
), select the first row of each group (.I[1]
)shift()
)NB: requires the original data to be sorted (or keyed) on time