Search code examples
rdplyrtime-series

R: Create a running count variable calculated based on previous row's value, that resets based on condition


I have a time-series dataset that uses totalized values to track gas flow in 15-minute intervals. Example dataframe (df) below.

 Date          Time       engine_totalizer
   12/25/2021  9:30:00        187304950
   12/25/2021  9:15:00        187304854
   12/25/2021  9:00:00        187304854
   12/25/2021  8:45:00        187304854
   12/25/2021  8:30:00        187304854
   12/25/2021  8:15:00        187304854
   12/25/2021  8:00:00        187304854
   12/25/2021  7:45:00        187304854
   12/25/2021  7:30:00        187304854
   12/25/2021  7:15:00        187304854
   12/25/2021  7:00:00        187304854

Periodically, the totalizer value will get "stuck" for a period of time. See the 'engine_totalizer' column above from 7:00:00 - 9:15:00.

I want to be able to track these gaps where the totalizer is "stuck" using a running counter variable ('gap_length' in the following example dataframe), that would be zero if the totalizer changes from one timestamp to another, but would count up from zero for every consecutive timestamp that the totalizer does not change from one timestamp to another.

Date          Time       engine_totalizer   gap_length
   12/25/2021  9:30:00        187304950        0
   12/25/2021  9:15:00        187304854        10      
   12/25/2021  9:00:00        187304854        9
   12/25/2021  8:45:00        187304854        8
   12/25/2021  8:30:00        187304854        7
   12/25/2021  8:15:00        187304854        6
   12/25/2021  8:00:00        187304854        5
   12/25/2021  7:45:00        187304854        4
   12/25/2021  7:30:00        187304854        3
   12/25/2021  7:15:00        187304854        2
   12/25/2021  7:00:00        187304854        1
   12/25/2021  6:45:00        187304700        0

I have tried to accomplish this using dplyr and case_when:

# Initialize gap_length 
df$gap_length <- 0
    df<-df%>%
      mutate(gap_length = case_when(engine_totalizer == lead(engine_totalizer) ~ lead(gap_length) + 1,
                             TRUE ~ 0))

But when I use the above code, gap_length just returns zero for every timestamp.

I'm not sure if I am using the lead() function incorrectly, or if there is a more efficient way to accomplish this.

Thanks!


Solution

  • To me, it seems like mutate() is having trouble using the updated gap value in the proceeding row (i.e., it seems to think lead(gap) is still 0 no matter what). Not an expert on this but this is a solution I came up with (note: you have to sort your data by time descending -- so earliest to latest -- for this to work how you want):

    #sort data first
    df <- df %>% arrange(desc(date))
    
    new_vect <- vector()
    
    for (i in 1:nrow(df)) {
      if (i != 1){
        if(df$totalizer[i] == df$totalizer[i-1]) {
          new_vect <- append(new_vect, new_vect[i-1] + 1)
        } else {
          new_vect <- append(new_vect, 0)
        }
      }else {
        new_vect <- append(new_vect, 0)
      }
    }
    
    df$gap <- new_vect
    
    #can always re-arrange the data after
    df <- df %>% arrange(date)