Search code examples
rdatetimeduplicatesrowrbind

Duplicating and modifying rows based on datetime


I have got have a data.table that looks like this

library(dplyr)
library(data.table)

dt <- data.table(ID=c("A001","A002","A003","A004"),start_time=c('2019-06-18 05:18:00','2020-03-04 05:00:00',
                 '2019-05-10 19:00:00','2020-01-06 22:42:00'),end_time=c('2019-06-18 08:41:00','2020-03-04 05:04:00',
                 '2019-05-10 19:08:00','2020-01-07 03:10:00'))
 ID      

    start_time            end_time duration
1: A001 2019-06-18 05:18:00 2019-06-18 08:41:00 203 mins
2: A002 2020-03-04 05:59:00 2020-03-04 06:04:00   5 mins
3: A003 2019-05-10 19:00:00 2019-05-10 19:08:00   8 mins
4: A004 2020-01-06 22:42:00 2020-01-07 03:10:00 268 mins

Duration was simply calculated as

dt$start_time <- as.POSIXct(dt$start_time, tz='UTC')  
dt$end_time <- as.POSIXct(dt$end_time, tz='UTC')
dt <- dt %>% mutate(duration = (end_time-start_time))

I need to duplicate rows where duration is larger than the end of the hour from start_time (records that cover > 1 hour). I need to change for them start time (beginning of the hour), end time - end of hour OR the original end time if if's the last row (last viewing hour),and duration accordingly, so that the final output would look like:

    dt_expected <- data.table(ID=c("A001","A001","A001","A001","A002","A002","A003","A004","A004","A004","A004","A004","A004"),
start_time=c('2019-06-18 05:18:00','2019-06-18 06:00:00','2019-06-18 07:00:00','2019-06-18 08:00:00', '2020-03-04 05:59:00', '2020-03-04 06:00:00',  '2019-05-10 19:00:00',
'2020-01-06 22:42:00', '2020-01-06 23:00:00','2020-01-07 00:00:00','2020-01-07 01:00:00','2020-01-07 02:00:00','2020-01-07 03:00:00'),
end_time=c('2019-06-18 05:59:00','2019-06-18 06:59:00','2019-06-18 07:59:00','2019-06-18 08:41:00','2020-03-04 05:59:00','2020-03-04 06:04:00',   '2019-05-10 19:08:00',    '2020-01-06 22:59:00','2020-01-06 23:59:00','2020-01-07 00:59:00','2020-01-07 01:59:00', '2020-01-07 02:59:00','2020-01-07 03:10:00'), 
duration = c(12,60,60,41,1,4,8,18,60,60,60,60,10)) 

Note that records for ID A002 should also be duplicated as duration happened in 2 different hours.

      ID          start_time            end_time duration
 1: A001 2019-06-18 05:18:00 2019-06-18 05:59:00       12
 2: A001 2019-06-18 06:00:00 2019-06-18 06:59:00       60
 3: A001 2019-06-18 07:00:00 2019-06-18 07:59:00       60
 4: A001 2019-06-18 08:00:00 2019-06-18 08:41:00       41
 5: A002 2020-03-04 05:59:00 2020-03-04 05:59:00        1
 6: A002 2020-03-04 06:00:00 2020-03-04 06:04:00        4
 7: A003 2019-05-10 19:00:00 2019-05-10 19:08:00        8
 8: A004 2020-01-06 22:42:00 2020-01-06 22:59:00       18
 9: A004 2020-01-06 23:00:00 2020-01-06 23:59:00       60
10: A004 2020-01-07 00:00:00 2020-01-07 00:59:00       60
11: A004 2020-01-07 01:00:00 2020-01-07 01:59:00       60
12: A004 2020-01-07 02:00:00 2020-01-07 02:59:00       60
13: A004 2020-01-07 03:00:00 2020-01-07 03:10:00       10

Solution

  • I think this is pretty close to what you're looking for.

    This creates new rows of start and end times, one row for each hour using map from purrr.

    Then, for each ID, it will determine start_time and end_time using pmin.

    First, for the end_time, it takes the minimum value between that row's end_time and an hour later than the start_time for that row. For example, the first row for A001 would have end_time of 6:00, which is the ceiling_date time for 5:18 to the nearest hour, and less than 6:18 from the sequence generated from map. For the last row for A001, the end_time is 8:41, which is less than the ceiling_date time of 9:00.

    The start_time will take the minimum value between the last row's end_time and that row's start_time. For example, the second row of A001 will have 6:00, which is the row above's end_time which is less than 6:18 from the sequence generated from map.

    Note that one row has 0 minutes for duration - the time fell right on the hour (19:00:00). These could be filtered out.

    library(purrr)
    library(dplyr)
    library(tidyr)
    library(lubridate)
    
    dt %>%
      rowwise() %>%
      mutate(start_time = map(start_time, ~seq.POSIXt(., ceiling_date(end_time, "hour"), by = "hour"))) %>%
      unnest(start_time) %>%
      group_by(ID) %>%
      mutate(end_time = pmin(ceiling_date(start_time, unit = "hour"), end_time),
             start_time = pmin(floor_date(lag(end_time, default = first(end_time)), unit = "hour"), start_time),
             duration = difftime(end_time, start_time, units = "mins"))
    

    Output

       ID    start_time          end_time            duration
       <chr> <dttm>              <dttm>              <drtn>  
     1 A001  2019-06-18 05:18:00 2019-06-18 06:00:00 42 mins 
     2 A001  2019-06-18 06:00:00 2019-06-18 07:00:00 60 mins 
     3 A001  2019-06-18 07:00:00 2019-06-18 08:00:00 60 mins 
     4 A001  2019-06-18 08:00:00 2019-06-18 08:41:00 41 mins 
     5 A002  2020-03-04 05:59:00 2020-03-04 06:00:00  1 mins 
     6 A002  2020-03-04 06:00:00 2020-03-04 06:04:00  4 mins 
     7 A003  2019-05-10 19:00:00 2019-05-10 19:00:00  0 mins 
     8 A003  2019-05-10 19:00:00 2019-05-10 19:08:00  8 mins 
     9 A004  2020-01-06 22:42:00 2020-01-06 23:00:00 18 mins 
    10 A004  2020-01-06 23:00:00 2020-01-07 00:00:00 60 mins 
    11 A004  2020-01-07 00:00:00 2020-01-07 01:00:00 60 mins 
    12 A004  2020-01-07 01:00:00 2020-01-07 02:00:00 60 mins 
    13 A004  2020-01-07 02:00:00 2020-01-07 03:00:00 60 mins 
    14 A004  2020-01-07 03:00:00 2020-01-07 03:10:00 10 mins