Search code examples
rdplyrdata.table

In R data.table find time intervals between two different tags in time, tag- data


I have a tabular log file with the (simplified) structure: <time>, <event_tag> and want to find intervals between two different event_tags, problem and all fine. The difficulty is that all fine and also problem is repeated regularly. So the algorithm by hand would be to find the first problem and look for the next all fine and proceed until the last problem and subsequent all fine.

Sample dataset:

library(data.table)
set.seed(156125)
DT <- data.table(time = seq(as.POSIXct(tz = "UTC", "2024-01-01"), 
                            as.POSIXct(tz = "UTC", "2024-01-10"), 
                            by = "12 hours"),
                 event_tag = c("problem", "all fine")[round(runif(19, 1.2, 2.49))])

# time                    event_tag
# <POSc>                     <char>
#  1: 2024-01-01 00:00:00  all fine
#  2: 2024-01-01 12:00:00  all fine
#  3: 2024-01-02 00:00:00   problem
#  4: 2024-01-02 12:00:00  all fine
#  5: 2024-01-03 00:00:00  all fine
#  6: 2024-01-03 12:00:00   problem
#  7: 2024-01-04 00:00:00   problem
#  8: 2024-01-04 12:00:00  all fine
#  9: 2024-01-05 00:00:00  all fine
# 10: 2024-01-05 12:00:00  all fine
# 11: 2024-01-06 00:00:00  all fine
# 12: 2024-01-06 12:00:00  all fine
# 13: 2024-01-07 00:00:00   problem
# 14: 2024-01-07 12:00:00  all fine
# 15: 2024-01-08 00:00:00  all fine
# 16: 2024-01-08 12:00:00   problem
# 17: 2024-01-09 00:00:00  all fine
# 18: 2024-01-09 12:00:00  all fine
# 19: 2024-01-10 00:00:00  all fine

Desired result:

data.table(problem_start = DT$time[c(3, 6, 13, 16)],
           problem_end = DT$time[c(4, 8, 14, 17)])

#          problem_start         problem_end
#                 <POSc>              <POSc>
# 1: 2024-01-02 00:00:00 2024-01-02 12:00:00
# 2: 2024-01-03 12:00:00 2024-01-04 12:00:00
# 3: 2024-01-07 00:00:00 2024-01-07 12:00:00
# 4: 2024-01-08 12:00:00 2024-01-09 00:00:00

I thought about some solution by making the two tags boolean and use cumsum, but could not figure it out completely. And maybe there is a neat data.table-way to do it which I don't see at the moment. However, I would also be happy with a dplyr solution even though I prefer data.table.

DT[ , bool := ifelse(event_tag == "all fine", 0, 1)]
DT[ , cumsum(bool)]

Solution

  • Here is a different data.table approach (although I prefer the rolling join solution)

    DT[DT[, .I[1], by = rleid(event_tag)]$V1
       ][, end_time := shift(time, type = "lead")
         ][event_tag == "problem", ]
    

    Functionality:

    • create groups of consecutive event_tag (rleid()), select the first row of each group (.I[1])
    • endtime is the first row of the next group/row (shift())
    • select only the problem-rows

    NB: requires the original data to be sorted (or keyed) on time