I have a data frame of two columns, one column (Time) contains timestamps from images, the other column (State) contains state shown on the image. These data frames can span several weeks, and have hundreds of rows per day. The "state" in consecutive rows can change, but doesn't have to change.
Example data:
Time | State |
---|---|
20220526173731 | PORED |
20220526173741 | PORED |
20220526173746 | SJEDI |
20220526175242 | PORED |
20220526175246 | SJEDI |
20220526175806 | SJEDI |
20220526175810 | SJEDI |
20220526175818 | NEMA |
20220526175819 | SJEDI |
20220526175819 | SJEDI |
20220526175822 | SJEDI |
20220526180013 | SJEDI |
20220522071053 | NEMA |
20220522071056 | NEMA |
What I'd like to do is use R to calculate the duration (in seconds) of each state for the entire period, the duration (in seconds) of each state for each day and the duration (in seconds) of each state before it changes to the next state. Also, I'd like to calculate the number of times the state changes for the entire period and for each day.
What I did so far:
library(dplyr)
library(data.table)
# input dhe data frame
dat <- read.csv("Lala.csv")
dat$Time <- as.character(dat$Time)
dat$Time <- as.POSIXct(dat$Time, format="%Y%m%d%H%M%S", tz="CET")
dat$Time
dat$State <- as.factor(dat$State)
# give each state run a unique number
setDT(dat)
dat[, state_run := cumsum(c(TRUE, diff(as.integer(dat$State)) != 0L))]
head(dat,20)
# calculate the duration of each state
dat2 <- dat[, list(StartTime = min(Time),
State = State[1],
Duration = diff(range(Time))), by = state_run]
Unfortunately, this does not give me the duration of each state, it gives me duration from the first row of each state to the last row of each state, which is not the end of the state. The end of the state if the first row of the next state.
Thank you for any help!
Updating with expected output for the example data:
State duration per day
Day | State | Duration |
---|---|---|
2022-05-22 | NEMA | 60547 |
2022-05-22 | PORED | 0 |
2022-05-22 | SJEDI | 0 |
2022-05-26 | NEMA | 63452 |
2022-05-26 | PORED | 19 |
2022-05-26 | SJEDI | 1342 |
Total state durations
State | Duration |
---|---|
NEMA | 123999 |
PORED | 19 |
SJEDI | 1342 |
Number of state changes per day
Day | # of State changes |
---|---|
2022-05-22 | 0 |
2022-05-26 | 6 |
This seems mostly like simple summaries in data.table
.
Time
to a POSIXt
object so that we can rely on time differencing (where some are multi-days ... and while not present here, if anything spans to a different month, your math without this will be wrong).Day
.Time
, so that we can calculate diff
, the number of seconds between each row and the next. (The last row is an unknown duration, over to you for the context of how to know how long the last row lasts.)library(data.table)
setDT(quux)
quux[, Time := as.POSIXct(as.character(Time), format="%Y%m%d%H%M%S")
][, Day := as.Date(Time)]
# Time State Day
# <POSc> <char> <Date>
# 1: 2022-05-26 17:37:31 PORED 2022-05-26
# 2: 2022-05-26 17:37:41 PORED 2022-05-26
# 3: 2022-05-26 17:37:46 SJEDI 2022-05-26
# 4: 2022-05-26 17:52:42 PORED 2022-05-26
# 5: 2022-05-26 17:52:46 SJEDI 2022-05-26
# 6: 2022-05-26 17:58:06 SJEDI 2022-05-26
# 7: 2022-05-26 17:58:10 SJEDI 2022-05-26
# 8: 2022-05-26 17:58:18 NEMA 2022-05-26
# 9: 2022-05-26 17:58:19 SJEDI 2022-05-26
# 10: 2022-05-26 17:58:19 SJEDI 2022-05-26
# 11: 2022-05-26 17:58:22 SJEDI 2022-05-26
# 12: 2022-05-26 18:00:13 SJEDI 2022-05-26
# 13: 2022-05-22 07:10:53 NEMA 2022-05-22
# 14: 2022-05-22 07:10:56 NEMA 2022-05-22
setorder(quux, Time)
quux[, diff := c(as.numeric(diff(Time), units = "secs"), NA)]
# Time State Day diff
# <POSc> <char> <Date> <num>
# 1: 2022-05-22 07:10:53 NEMA 2022-05-22 3
# 2: 2022-05-22 07:10:56 NEMA 2022-05-22 383195
# 3: 2022-05-26 17:37:31 PORED 2022-05-26 10
# 4: 2022-05-26 17:37:41 PORED 2022-05-26 5
# 5: 2022-05-26 17:37:46 SJEDI 2022-05-26 896
# 6: 2022-05-26 17:52:42 PORED 2022-05-26 4
# 7: 2022-05-26 17:52:46 SJEDI 2022-05-26 320
# 8: 2022-05-26 17:58:06 SJEDI 2022-05-26 4
# 9: 2022-05-26 17:58:10 SJEDI 2022-05-26 8
# 10: 2022-05-26 17:58:18 NEMA 2022-05-26 1
# 11: 2022-05-26 17:58:19 SJEDI 2022-05-26 0
# 12: 2022-05-26 17:58:19 SJEDI 2022-05-26 3
# 13: 2022-05-26 17:58:22 SJEDI 2022-05-26 111
# 14: 2022-05-26 18:00:13 SJEDI 2022-05-26 NA
Your summaries:
quux[, sum(diff, na.rm = TRUE), by = .(State)]
# State V1
# <char> <num>
# 1: NEMA 383199
# 2: PORED 19
# 3: SJEDI 1342
quux[, sum(diff, na.rm = TRUE), by = .(State, Day)]
# State Day V1
# <char> <Date> <num>
# 1: NEMA 2022-05-22 383198
# 2: PORED 2022-05-26 19
# 3: SJEDI 2022-05-26 1342
# 4: NEMA 2022-05-26 1
quux[, sum(diff, na.rm = TRUE), by = .(rleid(State), State)]
# rleid State V1
# <int> <char> <num>
# 1: 1 NEMA 383198
# 2: 2 PORED 15
# 3: 3 SJEDI 896
# 4: 4 PORED 4
# 5: 5 SJEDI 332
# 6: 6 NEMA 1
# 7: 7 SJEDI 114
quux[, sum(State[-1] != State[-.N]), by = Day]
# Day V1
# <Date> <int>
# 1: 2022-05-22 0
# 2: 2022-05-26 5
Starting data:
quux <- structure(list(Time = c(20220526173731, 20220526173741, 20220526173746, 20220526175242, 20220526175246, 20220526175806, 20220526175810, 20220526175818, 20220526175819, 20220526175819, 20220526175822, 20220526180013, 20220522071053, 20220522071056), State = c("PORED", "PORED", "SJEDI", "PORED", "SJEDI", "SJEDI", "SJEDI", "NEMA", "SJEDI", "SJEDI", "SJEDI", "SJEDI", "NEMA", "NEMA")), class = "data.frame", row.names = c(NA, -14L))