Search code examples
rdplyrtimestampdata.tabledifftime

Difference between timestamps per day, with a night break in R


I am trying to calculate the difference between two timestamps and aggregate this per date and uid. I have calculated the difference for each time interval, but would like to split these intervals per day (so create some kind of midnight break). Given that the differences between time intervals go overnight, I am having trouble calculating the time between each interval per day, and thus these breaks.

Here's a snippet of my data:

df <- structure(list(
start_timestamp = c("2013-03-27 01:21:23", "2013-03-28 07:11:58", "2013-03-28 09:09:56", "2013-03-29 00:19:32", "2013-03-29 02:22:53"), 
uid = c(0, 0, 0, 0, 0),
prev_start_timestamp = c("2013-03-27 01:13:26", "2013-03-27 05:58:53", "2013-03-28 08:41:41", "2013-03-28 10:47:01", "2013-03-29 02:17:26")), row.names = c("1", "2", "3", "4", 
"5"), class = "data.frame")

Usually I would work from this output and use dplyr or data.table to aggregate per day. but now, these time differences are simply calculated by substracting the time stamps. While, I want to have this differences split per day.

output, undesired

Expected output would be someting like this: But this one does not make any overnight breaks to separate the time each day.. this output indicates that some days have time intervals of longer than 24 hours, which is not possible of course..

enter image description here


Solution

  • Here is an option using data.table::foverlaps:

    #create a data.table of daily intervals
    datetimes <- DT[, seq(trunc(min(start), "days"), trunc(max(end)+24*60*60), "days")]
    days <- data.table(start=datetimes[-length(datetimes)], end=datetimes[-1L], key=cols)
    
    #set keys on original dataset and perform overlaps before calculating usage per day
    setkeyv(DT, cols)
    foverlaps(DT, days)[,
        .(phone_usage=sum(pmin(i.end, end) - pmax(i.start, start))), 
        .(uid, date=start)]
    

    output (will leave to OP to check manually if this is correct for the sample dataset):

       uid       date phone_usage
    1:   0 2013-03-27  65344 secs
    2:   0 2013-03-28  75192 secs
    3:   0 2013-03-29   1499 secs
    

    data (note that i have shortened OP's column names):

    library(data.table)
    DT <- data.table(
        end = c("2013-03-27 01:21:23", "2013-03-28 07:11:58", "2013-03-28 09:09:56", "2013-03-29 00:19:32", "2013-03-29 02:22:53"), 
        uid = c(0, 0, 0, 0, 0),
        start = c("2013-03-27 01:13:26", "2013-03-27 05:58:53", "2013-03-28 08:41:41", "2013-03-28 10:47:01", "2013-03-29 02:17:26"))
    cols <- c('start', 'end')
    DT[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %T"), .SDcols=cols]