Search code examples
rdatetimecumsum

using cumsum with conditions to reset, two scenarios


I have a dataframe, where I am trying to find two things: 1) the start of an event and, 2) the end of the event. The start of an event is based on a cumulative sum threshold, whereas the end of an event is dependent on there being 5 rows with 0 values between the last row with a value greater than 0, and the current time.

Example data is as follows

# hourly time series
a <- seq(from=as.POSIXct("2012-06-01 0:00", tz="UTC"),
         to=as.POSIXct("2012-09-01 00:00", tz="UTC"),
         by="hour")  

# mock data 
b <- sample.int(10, 2209, replace = TRUE)*sample(c(0,1), replace=TRUE, size=2209)

# mock time series data table
c <- data.table(a,b)
                        a b
   1: 2012-06-01 00:00:00 0
   2: 2012-06-01 01:00:00 0
   3: 2012-06-01 02:00:00 0
   4: 2012-06-01 03:00:00 7
   5: 2012-06-01 04:00:00 0
  ---                      
2205: 2012-08-31 20:00:00 8
2206: 2012-08-31 21:00:00 4
2207: 2012-08-31 22:00:00 2
2208: 2012-08-31 23:00:00 0
2209: 2012-09-01 00:00:00 0
---

I want to identify events within the time series, based on a threshold of a cumulative sum of 10 (in column b). So when a date/time has a cumulative sum of 10 or more, the event starts.

c$cumsum <- with(c, ave(b, cumsum(b == 0), FUN = cumsum))

                        a b cumsum
   1: 2012-06-01 00:00:00 0      0
   2: 2012-06-01 01:00:00 0      0
   3: 2012-06-01 02:00:00 0      0
   4: 2012-06-01 03:00:00 7      7
   5: 2012-06-01 04:00:00 0      0
  ---                             
2205: 2012-08-31 20:00:00 8      8
2206: 2012-08-31 21:00:00 4     12
2207: 2012-08-31 22:00:00 2     14
2208: 2012-08-31 23:00:00 0      0
2209: 2012-09-01 00:00:00 0      0

For example, in the above code, an event would begin at 2012-08-31 21:00:00 due to the cumulative sum of b = 12. Also, although 2012-08-31 22:00:00 has a cumsum of 14, it is not the start of an event, as the event had begun the hour prior to it (based on the condition of event beginning when cumsum => 10).

I also need to find the end of the event, and this is where I'm stuck. The end of the event would occur when 5 hours have passed, without any values (i.e. 5 rows with 0's in column b). Then I would like to create a dataframe, which consists of only events (i.e. date/time of start of an event, with the corresponding date/time of the end of that same event). This would look like (manual, fake example):

# dataframe for event start, and the corresponding cumsum of b
              event_start cumsum_b
   1: 2012-06-01 00:00:00 12
   2: 2012-06-09 11:00:00 11
   3: 2012-06-15 02:00:00 10

# dataframe for event end
              event_end   b
   1: 2012-06-01 00:7:00  0
   2: 2012-06-09 18:00:00 0
   3: 2012-06-15 12:00:00 0


Solution

  • library(tidyverse)
    
    df <- tibble(
      a = seq.Date(from = as.Date('2020-01-01'), length.out = 20, by = "days"),
      b = c(0, 0, 0, 7, 0, 8, 12, 0, 0, 0, 0, 0, 0, 14, 3, 0, 0, 0, 0, 0)
    )
    

    We can find the ends using lag. Then use cumsum and cummax to create the resetting cumulative sum.

    events <-
      df %>%
      mutate(
        is_end = coalesce(b == 0 & lag(b) == 0 & lag(b, 2) == 0 & lag(b, 3) == 0 & lag(b, 4) == 0 & lag(b, 5) != 0, FALSE),
        cumsum = cumsum(b) - cummax(is_end * cumsum(b)),
        is_event = cumsum >= 10,
        start = is_event & !lag(is_event),
        end = !is_event & lag(is_event)
      )
    events
    #> # A tibble: 20 x 7
    #>   a               b  is_end  cumsum is_event start end  
    #>   <date>      <dbl>  <lgl>   <dbl>  <lgl>    <lgl> <lgl>
    #> 1 2020-01-01      0  FALSE   0      FALSE    FALSE NA   
    #> 2 2020-01-02      0  FALSE   0      FALSE    FALSE FALSE
    #> 3 2020-01-03      0  FALSE   0      FALSE    FALSE FALSE
    #> 4 2020-01-04      7  FALSE   7      FALSE    FALSE FALSE
    #> 5 2020-01-05      0  FALSE   7      FALSE    FALSE FALSE
    #> 6 2020-01-06      8  FALSE   15     TRUE     TRUE  FALSE
    #> 7 2020-01-07     12  FALSE   27     TRUE     FALSE FALSE
    #> 8 2020-01-08      0  FALSE   27     TRUE     FALSE FALSE
    #> 9 2020-01-09      0  FALSE   27     TRUE     FALSE FALSE
    #> 10 2020-01-10     0  FALSE   27     TRUE     FALSE FALSE
    #> 11 2020-01-11     0  FALSE   27     TRUE     FALSE FALSE
    #> 12 2020-01-12     0  TRUE    0      FALSE    FALSE TRUE 
    #> 13 2020-01-13     0  FALSE   0      FALSE    FALSE FALSE
    #> 14 2020-01-14    14  FALSE   14     TRUE     TRUE  FALSE
    #> 15 2020-01-15     3  FALSE   17     TRUE     FALSE FALSE
    #> 16 2020-01-16     0  FALSE   17     TRUE     FALSE FALSE
    #> 17 2020-01-17     0  FALSE   17     TRUE     FALSE FALSE
    #> 18 2020-01-18     0  FALSE   17     TRUE     FALSE FALSE
    #> 19 2020-01-19     0  FALSE   17     TRUE     FALSE FALSE
    #> 20 2020-01-20     0  TRUE    0      FALSE    FALSE TRUE 
    

    Then, pull out the start and end dates.

    tibble(
      event_start = events %>% filter(start) %>% pull(a),
      event_end = events %>% filter(end) %>% pull(a)
    )
    #> # A tibble: 2 x 2
    #>   event_start event_end 
    #>   <date>      <date>    
    #> 1 2020-01-06  2020-01-12
    #> 2 2020-01-14  2020-01-20
    

    If you want don't want to manually specify the lags

    find_end <- function(x, n) {
      is_n_consecutive_zeros <-
        map(0:(n-1), ~lag(x, .)) %>%
        pmap_lgl(function(...) all(c(...) == 0))
    
      coalesce(is_n_consecutive_zeros & lag(x, n) != 0, FALSE)
    }
    
    df %>%
      mutate(
        is_end = find_end(b, 5),
        cumsum = cumsum(b) - cummax(is_end * cumsum(b)),
        is_event = cumsum >= 10,
        start = is_event & !lag(is_event),
        end = !is_event & lag(is_event)
      )
    #> # A tibble: 20 x 7
    #>    a              b is_end cumsum is_event start end  
    #>    <date>     <dbl> <lgl>   <dbl> <lgl>    <lgl> <lgl>
    #>  1 2020-01-01     0 FALSE       0 FALSE    FALSE NA   
    #>  2 2020-01-02     0 FALSE       0 FALSE    FALSE FALSE
    #>  3 2020-01-03     0 FALSE       0 FALSE    FALSE FALSE
    #>  4 2020-01-04     7 FALSE       7 FALSE    FALSE FALSE
    #>  5 2020-01-05     0 FALSE       7 FALSE    FALSE FALSE
    #>  6 2020-01-06     8 FALSE      15 TRUE     TRUE  FALSE
    #>  7 2020-01-07    12 FALSE      27 TRUE     FALSE FALSE
    #>  8 2020-01-08     0 FALSE      27 TRUE     FALSE FALSE
    #>  9 2020-01-09     0 FALSE      27 TRUE     FALSE FALSE
    #> 10 2020-01-10     0 FALSE      27 TRUE     FALSE FALSE
    #> 11 2020-01-11     0 FALSE      27 TRUE     FALSE FALSE
    #> 12 2020-01-12     0 TRUE        0 FALSE    FALSE TRUE 
    #> 13 2020-01-13     0 FALSE       0 FALSE    FALSE FALSE
    #> 14 2020-01-14    14 FALSE      14 TRUE     TRUE  FALSE
    #> 15 2020-01-15     3 FALSE      17 TRUE     FALSE FALSE
    #> 16 2020-01-16     0 FALSE      17 TRUE     FALSE FALSE
    #> 17 2020-01-17     0 FALSE      17 TRUE     FALSE FALSE
    #> 18 2020-01-18     0 FALSE      17 TRUE     FALSE FALSE
    #> 19 2020-01-19     0 FALSE      17 TRUE     FALSE FALSE
    #> 20 2020-01-20     0 TRUE        0 FALSE    FALSE TRUE