Search code examples
rdatetimetime-seriesuniqueidentifier

Assign unique ID within a time interval following an event


This is a bit of a curious case for which I have been unable to find a solution on stackoverflow. I have a dataset with a date-time column and a column of values that indicate an event, such as in the dat example below. The date-times are every hour, however, note that occasional "missed" hours exist (2 hours are missing between rows 12 & 13).

dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")), 
                                 max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
                  event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6))
> dat
              datetime event
1  2010-04-03 03:00:00     1
2  2010-04-03 04:00:00    NA
3  2010-04-03 05:00:00    NA
4  2010-04-03 06:00:00    NA
5  2010-04-03 07:00:00    NA
6  2010-04-03 08:00:00    NA
7  2010-04-03 09:00:00    NA
8  2010-04-03 10:00:00    NA
9  2010-04-03 11:00:00    NA
10 2010-04-03 12:00:00    NA
11 2010-04-03 13:00:00     2
12 2010-04-03 14:00:00    NA
13 2010-04-03 17:00:00    NA
14 2010-04-03 18:00:00    NA
15 2010-04-03 19:00:00    NA
16 2010-04-03 20:00:00    NA
17 2010-04-03 21:00:00     3
18 2010-04-03 22:00:00     4
19 2010-04-03 23:00:00    NA
20 2010-04-04 00:00:00    NA
21 2010-04-04 01:00:00    NA
22 2010-04-04 02:00:00    NA
23 2010-04-04 03:00:00    NA
24 2010-04-04 04:00:00    NA
25 2010-04-04 05:00:00    NA
26 2010-04-04 06:00:00    NA
27 2010-04-04 07:00:00    NA
28 2010-04-04 08:00:00     5
29 2010-04-04 09:00:00    NA
30 2010-04-04 10:00:00     6

I would like each row within an interval of 7 hours after the event occurs to be identified with a unique identifier, but with the following caveats (hence the "curious case"):

  • if a subsequent event occurs within the 7 hours of the event prior, that subsequent event is essentially ignored (i.e., "event" number does not equal assigned identifier value), and
  • missing times are accounted for (i.e., the rule is based on the time elapsed, not the number of rows).

The product would look like result:

library(dplyr)

result <- dat %>% 
  mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
> result
              datetime event id
1  2010-04-03 03:00:00     1  1
2  2010-04-03 04:00:00    NA  1
3  2010-04-03 05:00:00    NA  1
4  2010-04-03 06:00:00    NA  1
5  2010-04-03 07:00:00    NA  1
6  2010-04-03 08:00:00    NA  1
7  2010-04-03 09:00:00    NA  1
8  2010-04-03 10:00:00    NA  1
9  2010-04-03 11:00:00    NA NA
10 2010-04-03 12:00:00    NA NA
11 2010-04-03 13:00:00     2  2
12 2010-04-03 14:00:00    NA  2
13 2010-04-03 17:00:00    NA  2
14 2010-04-03 18:00:00    NA  2
15 2010-04-03 19:00:00    NA  2
16 2010-04-03 20:00:00    NA  2
17 2010-04-03 21:00:00     3  3
18 2010-04-03 22:00:00     4  3
19 2010-04-03 23:00:00    NA  3
20 2010-04-04 00:00:00    NA  3
21 2010-04-04 01:00:00    NA  3
22 2010-04-04 02:00:00    NA  3
23 2010-04-04 03:00:00    NA  3
24 2010-04-04 04:00:00    NA  3
25 2010-04-04 05:00:00    NA NA
26 2010-04-04 06:00:00    NA NA
27 2010-04-04 07:00:00    NA NA
28 2010-04-04 08:00:00     5  4
29 2010-04-04 09:00:00    NA  4
30 2010-04-04 10:00:00     6  4

Most ideally, this would be accomplished in a dplyr framework.


Solution

  • library(lubridate)
    library(tidyverse)
    
    dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")), 
                                     max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
                      event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6)) %>% 
      mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
    
    
    Events <- dat %>% 
      #Get only the roes with events
      filter(!is.na(event)) %>% 
      #Get the duration of time between events
      mutate(
        EventLag = datetime - lag(datetime)) %>% 
      ## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
      ## I do not suspect your first point would ever be an event...? Maybe this can be removed in the 
      ## real dataset...
      filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>% 
      as.data.frame()
    
    ## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
    ## Give the events a new ID so there are no gaps
    ## Join them with the rest of the datetime stamps
    Events <- Events %>% 
      mutate(ID = row_number()) %>% 
      dplyr::select(datetime, ID)
    
    
    ## Expand each event by 7 hrs
    Events <- Events %>%
      group_by(ID) %>%
      do(data.frame(ID= .$ID, datetime= seq(.$datetime, .$datetime + hours(7), by = '1 hour'), stringsAsFactors=FALSE)) %>% 
      as.data.frame()
    
    
    ## Join with initial data by datettime
    DatJoin <- dat %>% 
      left_join(Events, by = "datetime")
    
    
    DatJoin