r time-series data.table counting duration

Calculate durations of states and number of state changes from a time series

I have a data frame of two columns, one column (Time) contains timestamps from images, the other column (State) contains state shown on the image. These data frames can span several weeks, and have hundreds of rows per day. The "state" in consecutive rows can change, but doesn't have to change.

Example data:

Time	State
20220526173731	PORED
20220526173741	PORED
20220526173746	SJEDI
20220526175242	PORED
20220526175246	SJEDI
20220526175806	SJEDI
20220526175810	SJEDI
20220526175818	NEMA
20220526175819	SJEDI
20220526175819	SJEDI
20220526175822	SJEDI
20220526180013	SJEDI
20220522071053	NEMA
20220522071056	NEMA

What I'd like to do is use R to calculate the duration (in seconds) of each state for the entire period, the duration (in seconds) of each state for each day and the duration (in seconds) of each state before it changes to the next state. Also, I'd like to calculate the number of times the state changes for the entire period and for each day.

What I did so far:

library(dplyr)
library(data.table)

# input dhe data frame
dat <- read.csv("Lala.csv")


dat$Time <- as.character(dat$Time)
dat$Time <- as.POSIXct(dat$Time, format="%Y%m%d%H%M%S", tz="CET")
dat$Time

dat$State <- as.factor(dat$State)

# give each state run a unique number
setDT(dat)
dat[, state_run := cumsum(c(TRUE, diff(as.integer(dat$State)) != 0L))]
head(dat,20)

# calculate the duration of each state 
dat2 <- dat[, list(StartTime = min(Time),
                     State = State[1],
                     Duration = diff(range(Time))), by = state_run]

Unfortunately, this does not give me the duration of each state, it gives me duration from the first row of each state to the last row of each state, which is not the end of the state. The end of the state if the first row of the next state.

Thank you for any help!

Updating with expected output for the example data:

State duration per day

Day	State	Duration
2022-05-22	NEMA	60547
2022-05-22	PORED	0
2022-05-22	SJEDI	0
2022-05-26	NEMA	63452
2022-05-26	PORED	19
2022-05-26	SJEDI	1342

Total state durations

State	Duration
NEMA	123999
PORED	19
SJEDI	1342

Number of state changes per day

Day	# of State changes
2022-05-22	0
2022-05-26	6

Solution

This seems mostly like simple summaries in data.table.

I'll convert your Time to a POSIXt object so that we can rely on time differencing (where some are multi-days ... and while not present here, if anything spans to a different month, your math without this will be wrong).
Add Day.
Order by Time, so that we can calculate diff, the number of seconds between each row and the next. (The last row is an unknown duration, over to you for the context of how to know how long the last row lasts.)

library(data.table)
setDT(quux)
quux[, Time := as.POSIXct(as.character(Time), format="%Y%m%d%H%M%S")
     ][, Day := as.Date(Time)]
#                    Time  State        Day
#                  <POSc> <char>     <Date>
#  1: 2022-05-26 17:37:31  PORED 2022-05-26
#  2: 2022-05-26 17:37:41  PORED 2022-05-26
#  3: 2022-05-26 17:37:46  SJEDI 2022-05-26
#  4: 2022-05-26 17:52:42  PORED 2022-05-26
#  5: 2022-05-26 17:52:46  SJEDI 2022-05-26
#  6: 2022-05-26 17:58:06  SJEDI 2022-05-26
#  7: 2022-05-26 17:58:10  SJEDI 2022-05-26
#  8: 2022-05-26 17:58:18   NEMA 2022-05-26
#  9: 2022-05-26 17:58:19  SJEDI 2022-05-26
# 10: 2022-05-26 17:58:19  SJEDI 2022-05-26
# 11: 2022-05-26 17:58:22  SJEDI 2022-05-26
# 12: 2022-05-26 18:00:13  SJEDI 2022-05-26
# 13: 2022-05-22 07:10:53   NEMA 2022-05-22
# 14: 2022-05-22 07:10:56   NEMA 2022-05-22
setorder(quux, Time)
quux[, diff := c(as.numeric(diff(Time), units = "secs"), NA)]
#                    Time  State        Day   diff
#                  <POSc> <char>     <Date>  <num>
#  1: 2022-05-22 07:10:53   NEMA 2022-05-22      3
#  2: 2022-05-22 07:10:56   NEMA 2022-05-22 383195
#  3: 2022-05-26 17:37:31  PORED 2022-05-26     10
#  4: 2022-05-26 17:37:41  PORED 2022-05-26      5
#  5: 2022-05-26 17:37:46  SJEDI 2022-05-26    896
#  6: 2022-05-26 17:52:42  PORED 2022-05-26      4
#  7: 2022-05-26 17:52:46  SJEDI 2022-05-26    320
#  8: 2022-05-26 17:58:06  SJEDI 2022-05-26      4
#  9: 2022-05-26 17:58:10  SJEDI 2022-05-26      8
# 10: 2022-05-26 17:58:18   NEMA 2022-05-26      1
# 11: 2022-05-26 17:58:19  SJEDI 2022-05-26      0
# 12: 2022-05-26 17:58:19  SJEDI 2022-05-26      3
# 13: 2022-05-26 17:58:22  SJEDI 2022-05-26    111
# 14: 2022-05-26 18:00:13  SJEDI 2022-05-26     NA

Your summaries:

quux[, sum(diff, na.rm = TRUE), by = .(State)]
#     State     V1
#    <char>  <num>
# 1:   NEMA 383199
# 2:  PORED     19
# 3:  SJEDI   1342
quux[, sum(diff, na.rm = TRUE), by = .(State, Day)]
#     State        Day     V1
#    <char>     <Date>  <num>
# 1:   NEMA 2022-05-22 383198
# 2:  PORED 2022-05-26     19
# 3:  SJEDI 2022-05-26   1342
# 4:   NEMA 2022-05-26      1
quux[, sum(diff, na.rm = TRUE), by = .(rleid(State), State)]
#    rleid  State     V1
#    <int> <char>  <num>
# 1:     1   NEMA 383198
# 2:     2  PORED     15
# 3:     3  SJEDI    896
# 4:     4  PORED      4
# 5:     5  SJEDI    332
# 6:     6   NEMA      1
# 7:     7  SJEDI    114
quux[, sum(State[-1] != State[-.N]), by = Day]
#           Day    V1
#        <Date> <int>
# 1: 2022-05-22     0
# 2: 2022-05-26     5

Starting data:

quux <- structure(list(Time = c(20220526173731, 20220526173741, 20220526173746, 20220526175242, 20220526175246, 20220526175806, 20220526175810, 20220526175818, 20220526175819, 20220526175819, 20220526175822, 20220526180013, 20220522071053, 20220522071056), State = c("PORED", "PORED", "SJEDI", "PORED", "SJEDI", "SJEDI", "SJEDI", "NEMA", "SJEDI", "SJEDI", "SJEDI", "SJEDI", "NEMA", "NEMA")), class = "data.frame", row.names = c(NA, -14L))