Search code examples
rdplyrlubridate

How to pull first and last date from a dataframe with conditions and breaks?


I have data on water intensity that was collected every hour over many months (below is a subset of data). I need the first date that intensity was greater than 900 but only if it was greater than or equal to 900 for at least 24 hours. Additionally, I need the date that the water subsequently dropped below 900 following the last day that water was above 900 for at least 24 hours. I then need to repeat that for all subsequent 24+hour periods of time water intensity goes above/below 900 again throughout the study time period. I'm hoping to avoid having to go through all the data by hand for every site.

## creates example dataframe
NoOfHours <- as.numeric(ymd_hms("2010-01-06 01:00:00") - ymd_hms("2010-01-01 07:00:00"))*24 
dt<-ymd_hms("2010-01-01 00:00:00") + hours(0:NoOfHours)
intensity<-c(rep(c(0),23),rep(c(901,904),12), rep(c(660,540),10), rep(c(905,3000),10), 550, rep(c(1000,1200),13),340)

df<-data.frame(dt, intensity)

So I should end up with:

                  dt1 status
1 2010-01-01 23:00:00  start
2 2010-01-02 23:00:00   stop
3 2010-01-04 16:00:00  start
4 2010-01-05 18:00:00   stop

Solution

  • Extract those rows with intensity greater than 900 and then create a grouping variable g using seqid which provides a unique id to each consecutive sequence. Reduce each such group to a single row with the dates of the first and last row adding one hour to the last. Extract those rows that represent at least 24 original rows. Convert that to long form so that start and stop are on different lines and then select out the desired columns.

    library(collapse)
    library(dplyr)
    library(tidyr)
    
    df %>%
      filter(intensity > 900) %>%
      mutate(g = seqid(dt, del = 3600)) %>%
      summarize(start = first(dt), stop = last(dt) + 3600, n = n(), .by = g) %>%
      filter(n >= 24) %>%
      pivot_longer(start:stop, names_to = "status", values_to = "dt1") %>%
      select(dt1, status)
    

    giving

    # A tibble: 4 × 2
      dt1                 status
      <dttm>              <chr> 
    1 2010-01-01 23:00:00 start 
    2 2010-01-02 23:00:00 stop  
    3 2010-01-04 16:00:00 start 
    4 2010-01-05 18:00:00 stop