I have a time-series dataset that uses totalized values to track gas flow in 15-minute intervals. Example dataframe (df) below.
Date Time engine_totalizer
12/25/2021 9:30:00 187304950
12/25/2021 9:15:00 187304854
12/25/2021 9:00:00 187304854
12/25/2021 8:45:00 187304854
12/25/2021 8:30:00 187304854
12/25/2021 8:15:00 187304854
12/25/2021 8:00:00 187304854
12/25/2021 7:45:00 187304854
12/25/2021 7:30:00 187304854
12/25/2021 7:15:00 187304854
12/25/2021 7:00:00 187304854
Periodically, the totalizer value will get "stuck" for a period of time. See the 'engine_totalizer' column above from 7:00:00 - 9:15:00.
I want to be able to track these gaps where the totalizer is "stuck" using a running counter variable ('gap_length' in the following example dataframe), that would be zero if the totalizer changes from one timestamp to another, but would count up from zero for every consecutive timestamp that the totalizer does not change from one timestamp to another.
Date Time engine_totalizer gap_length
12/25/2021 9:30:00 187304950 0
12/25/2021 9:15:00 187304854 10
12/25/2021 9:00:00 187304854 9
12/25/2021 8:45:00 187304854 8
12/25/2021 8:30:00 187304854 7
12/25/2021 8:15:00 187304854 6
12/25/2021 8:00:00 187304854 5
12/25/2021 7:45:00 187304854 4
12/25/2021 7:30:00 187304854 3
12/25/2021 7:15:00 187304854 2
12/25/2021 7:00:00 187304854 1
12/25/2021 6:45:00 187304700 0
I have tried to accomplish this using dplyr and case_when:
# Initialize gap_length
df$gap_length <- 0
df<-df%>%
mutate(gap_length = case_when(engine_totalizer == lead(engine_totalizer) ~ lead(gap_length) + 1,
TRUE ~ 0))
But when I use the above code, gap_length just returns zero for every timestamp.
I'm not sure if I am using the lead() function incorrectly, or if there is a more efficient way to accomplish this.
Thanks!
To me, it seems like mutate() is having trouble using the updated gap value in the proceeding row (i.e., it seems to think lead(gap) is still 0 no matter what). Not an expert on this but this is a solution I came up with (note: you have to sort your data by time descending -- so earliest to latest -- for this to work how you want):
#sort data first
df <- df %>% arrange(desc(date))
new_vect <- vector()
for (i in 1:nrow(df)) {
if (i != 1){
if(df$totalizer[i] == df$totalizer[i-1]) {
new_vect <- append(new_vect, new_vect[i-1] + 1)
} else {
new_vect <- append(new_vect, 0)
}
}else {
new_vect <- append(new_vect, 0)
}
}
df$gap <- new_vect
#can always re-arrange the data after
df <- df %>% arrange(date)