Search code examples
rdplyrdata.tabledifftime

Cumulative time with reset


I have a dataset that looks like this:

id      land    datetime
pb1     0       2004-04-05 01:44:00
pb1     1       2004-04-05 02:00:00
pb1     1       2004-04-05 16:00:00 
pb2     1       2004-04-05 18:01:00 
pb2     1       2004-04-05 20:00:00   

library(data.table) 
DT = data.table(
  id = c("pb1", "pb1", "pb1", "pb2", "pb2"), 
  land = c(0L, 1L, 1L, 1L, 1L), 
  datetime = sprintf("2004-04-05 %02d:%02d:00", 
                     c(1, 2, 16, 18, 20), 
                     c(44, 0, 0, 1, 0))
)

I would like to make a column that cumulatively adds time (in days) but ONLY if there is a '1' in the land column. I also would like the count to reset when the id changes.

I have tried a variety of methods using data.table, rleid, and even a nested for loop with no success. I have gotten errors using code like this:

DT[, total :=land*diff(as.numeric(datetime)), .(id, rleid(land))]

I have tried variations of the solution here: Calculating cumulative time in R

I'm not sure the best way to calculate the time interval (no success with difftime or lubridate).

I want the end result to look like this:

id      land           datetime         cumtime.land
pb1     0       2004-04-05 01:44:00     0
pb1     1       2004-04-05 02:00:00     0
pb1     1       2004-04-06 16:00:00     1.58333
pb2     1       2004-04-05 18:00:00     0
pb2     1       2004-04-05 20:00:00     0.08333

Solution

  • I could not replicate @Japp's comment, but you can easily do this with dplyr.

    Depending on what your exact expected output is, you could stop before the summarize call:

    library(dplyr)
    df=read.table(text=
        "id      land    datetime
        pb1     0       '2004-04-05 01:44:00'
        pb1     1       '2004-04-05 02:00:00'
        pb1     1       '2004-04-06 16:00:00'
        pb1     1       '2004-04-07 16:00:00'
        pb2     1       '2004-04-05 18:00:00' 
        pb2     1       '2004-04-05 20:00:00'", header=T) %>% 
      mutate(datetime=as.POSIXct(datetime,format='%Y-%m-%d %H:%M:%S'))
    
    x = df %>% 
      group_by(id) %>% 
      arrange(id, datetime) %>% 
      mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0, 
                                 0,
                                 difftime(datetime, lag(datetime), units="days"))) %>% 
      mutate(cumtime.land=time.land + ifelse(is.na(lag(time.land)), 0, lag(time.land)))
    
      id     land datetime            time.land cumtime.land
      <fct> <int> <dttm>                  <dbl>        <dbl>
    1 pb1       0 2004-04-05 01:44:00    0            0     
    2 pb1       1 2004-04-05 02:00:00    0            0     
    3 pb1       1 2004-04-06 16:00:00    1.58         1.58  
    4 pb1       1 2004-04-07 16:00:00    1            2.58  
    5 pb2       1 2004-04-05 18:00:00    0            0     
    6 pb2       1 2004-04-05 20:00:00    0.0833       0.0833
    

    The key is to use the dplyr::lag() function which takes the "line just above" in the table (which implies that you have to arrange() it beforehand).

    By wrapping this inside the ifelse, I'm checking that land and previous land were not 0 (and that we are not in the first line of the id, or lag(anything) will be missing).

    I then just reuse the lag() function to get the cumtime.land variable.