Search code examples
rtime-seriesdata.tablecountingduration

Calculate durations of states and number of state changes from a time series


I have a data frame of two columns, one column (Time) contains timestamps from images, the other column (State) contains state shown on the image. These data frames can span several weeks, and have hundreds of rows per day. The "state" in consecutive rows can change, but doesn't have to change.

Example data:

Time State
20220526173731 PORED
20220526173741 PORED
20220526173746 SJEDI
20220526175242 PORED
20220526175246 SJEDI
20220526175806 SJEDI
20220526175810 SJEDI
20220526175818 NEMA
20220526175819 SJEDI
20220526175819 SJEDI
20220526175822 SJEDI
20220526180013 SJEDI
20220522071053 NEMA
20220522071056 NEMA

What I'd like to do is use R to calculate the duration (in seconds) of each state for the entire period, the duration (in seconds) of each state for each day and the duration (in seconds) of each state before it changes to the next state. Also, I'd like to calculate the number of times the state changes for the entire period and for each day.

What I did so far:

library(dplyr)
library(data.table)

# input dhe data frame
dat <- read.csv("Lala.csv")


dat$Time <- as.character(dat$Time)
dat$Time <- as.POSIXct(dat$Time, format="%Y%m%d%H%M%S", tz="CET")
dat$Time

dat$State <- as.factor(dat$State)

# give each state run a unique number
setDT(dat)
dat[, state_run := cumsum(c(TRUE, diff(as.integer(dat$State)) != 0L))]
head(dat,20)

# calculate the duration of each state 
dat2 <- dat[, list(StartTime = min(Time),
                     State = State[1],
                     Duration = diff(range(Time))), by = state_run]

Unfortunately, this does not give me the duration of each state, it gives me duration from the first row of each state to the last row of each state, which is not the end of the state. The end of the state if the first row of the next state.

Thank you for any help!

Updating with expected output for the example data:

State duration per day

Day State Duration
2022-05-22 NEMA 60547
2022-05-22 PORED 0
2022-05-22 SJEDI 0
2022-05-26 NEMA 63452
2022-05-26 PORED 19
2022-05-26 SJEDI 1342

Total state durations

State Duration
NEMA 123999
PORED 19
SJEDI 1342

Number of state changes per day

Day # of State changes
2022-05-22 0
2022-05-26 6

Solution

  • This seems mostly like simple summaries in data.table.

    1. I'll convert your Time to a POSIXt object so that we can rely on time differencing (where some are multi-days ... and while not present here, if anything spans to a different month, your math without this will be wrong).
    2. Add Day.
    3. Order by Time, so that we can calculate diff, the number of seconds between each row and the next. (The last row is an unknown duration, over to you for the context of how to know how long the last row lasts.)
    library(data.table)
    setDT(quux)
    quux[, Time := as.POSIXct(as.character(Time), format="%Y%m%d%H%M%S")
         ][, Day := as.Date(Time)]
    #                    Time  State        Day
    #                  <POSc> <char>     <Date>
    #  1: 2022-05-26 17:37:31  PORED 2022-05-26
    #  2: 2022-05-26 17:37:41  PORED 2022-05-26
    #  3: 2022-05-26 17:37:46  SJEDI 2022-05-26
    #  4: 2022-05-26 17:52:42  PORED 2022-05-26
    #  5: 2022-05-26 17:52:46  SJEDI 2022-05-26
    #  6: 2022-05-26 17:58:06  SJEDI 2022-05-26
    #  7: 2022-05-26 17:58:10  SJEDI 2022-05-26
    #  8: 2022-05-26 17:58:18   NEMA 2022-05-26
    #  9: 2022-05-26 17:58:19  SJEDI 2022-05-26
    # 10: 2022-05-26 17:58:19  SJEDI 2022-05-26
    # 11: 2022-05-26 17:58:22  SJEDI 2022-05-26
    # 12: 2022-05-26 18:00:13  SJEDI 2022-05-26
    # 13: 2022-05-22 07:10:53   NEMA 2022-05-22
    # 14: 2022-05-22 07:10:56   NEMA 2022-05-22
    setorder(quux, Time)
    quux[, diff := c(as.numeric(diff(Time), units = "secs"), NA)]
    #                    Time  State        Day   diff
    #                  <POSc> <char>     <Date>  <num>
    #  1: 2022-05-22 07:10:53   NEMA 2022-05-22      3
    #  2: 2022-05-22 07:10:56   NEMA 2022-05-22 383195
    #  3: 2022-05-26 17:37:31  PORED 2022-05-26     10
    #  4: 2022-05-26 17:37:41  PORED 2022-05-26      5
    #  5: 2022-05-26 17:37:46  SJEDI 2022-05-26    896
    #  6: 2022-05-26 17:52:42  PORED 2022-05-26      4
    #  7: 2022-05-26 17:52:46  SJEDI 2022-05-26    320
    #  8: 2022-05-26 17:58:06  SJEDI 2022-05-26      4
    #  9: 2022-05-26 17:58:10  SJEDI 2022-05-26      8
    # 10: 2022-05-26 17:58:18   NEMA 2022-05-26      1
    # 11: 2022-05-26 17:58:19  SJEDI 2022-05-26      0
    # 12: 2022-05-26 17:58:19  SJEDI 2022-05-26      3
    # 13: 2022-05-26 17:58:22  SJEDI 2022-05-26    111
    # 14: 2022-05-26 18:00:13  SJEDI 2022-05-26     NA
    

    Your summaries:

    quux[, sum(diff, na.rm = TRUE), by = .(State)]
    #     State     V1
    #    <char>  <num>
    # 1:   NEMA 383199
    # 2:  PORED     19
    # 3:  SJEDI   1342
    quux[, sum(diff, na.rm = TRUE), by = .(State, Day)]
    #     State        Day     V1
    #    <char>     <Date>  <num>
    # 1:   NEMA 2022-05-22 383198
    # 2:  PORED 2022-05-26     19
    # 3:  SJEDI 2022-05-26   1342
    # 4:   NEMA 2022-05-26      1
    quux[, sum(diff, na.rm = TRUE), by = .(rleid(State), State)]
    #    rleid  State     V1
    #    <int> <char>  <num>
    # 1:     1   NEMA 383198
    # 2:     2  PORED     15
    # 3:     3  SJEDI    896
    # 4:     4  PORED      4
    # 5:     5  SJEDI    332
    # 6:     6   NEMA      1
    # 7:     7  SJEDI    114
    quux[, sum(State[-1] != State[-.N]), by = Day]
    #           Day    V1
    #        <Date> <int>
    # 1: 2022-05-22     0
    # 2: 2022-05-26     5
    

    Starting data:

    quux <- structure(list(Time = c(20220526173731, 20220526173741, 20220526173746, 20220526175242, 20220526175246, 20220526175806, 20220526175810, 20220526175818, 20220526175819, 20220526175819, 20220526175822, 20220526180013, 20220522071053, 20220522071056), State = c("PORED", "PORED", "SJEDI", "PORED", "SJEDI", "SJEDI", "SJEDI", "NEMA", "SJEDI", "SJEDI", "SJEDI", "SJEDI", "NEMA", "NEMA")), class = "data.frame", row.names = c(NA, -14L))