I have a dataset that looks like this:
id land datetime
pb1 0 2004-04-05 01:44:00
pb1 1 2004-04-05 02:00:00
pb1 1 2004-04-05 16:00:00
pb2 1 2004-04-05 18:01:00
pb2 1 2004-04-05 20:00:00
library(data.table)
DT = data.table(
id = c("pb1", "pb1", "pb1", "pb2", "pb2"),
land = c(0L, 1L, 1L, 1L, 1L),
datetime = sprintf("2004-04-05 %02d:%02d:00",
c(1, 2, 16, 18, 20),
c(44, 0, 0, 1, 0))
)
I would like to make a column that cumulatively adds time (in days) but ONLY if there is a '1' in the land
column. I also would like the count to reset when the id
changes.
I have tried a variety of methods using data.table
, rleid
, and even a nested for
loop with no success. I have gotten errors using code like this:
DT[, total :=land*diff(as.numeric(datetime)), .(id, rleid(land))]
I have tried variations of the solution here: Calculating cumulative time in R
I'm not sure the best way to calculate the time interval (no success with difftime
or lubridate
).
I want the end result to look like this:
id land datetime cumtime.land
pb1 0 2004-04-05 01:44:00 0
pb1 1 2004-04-05 02:00:00 0
pb1 1 2004-04-06 16:00:00 1.58333
pb2 1 2004-04-05 18:00:00 0
pb2 1 2004-04-05 20:00:00 0.08333
I could not replicate @Japp's comment, but you can easily do this with dplyr
.
Depending on what your exact expected output is, you could stop before the summarize
call:
library(dplyr)
df=read.table(text=
"id land datetime
pb1 0 '2004-04-05 01:44:00'
pb1 1 '2004-04-05 02:00:00'
pb1 1 '2004-04-06 16:00:00'
pb1 1 '2004-04-07 16:00:00'
pb2 1 '2004-04-05 18:00:00'
pb2 1 '2004-04-05 20:00:00'", header=T) %>%
mutate(datetime=as.POSIXct(datetime,format='%Y-%m-%d %H:%M:%S'))
x = df %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0,
0,
difftime(datetime, lag(datetime), units="days"))) %>%
mutate(cumtime.land=time.land + ifelse(is.na(lag(time.land)), 0, lag(time.land)))
id land datetime time.land cumtime.land
<fct> <int> <dttm> <dbl> <dbl>
1 pb1 0 2004-04-05 01:44:00 0 0
2 pb1 1 2004-04-05 02:00:00 0 0
3 pb1 1 2004-04-06 16:00:00 1.58 1.58
4 pb1 1 2004-04-07 16:00:00 1 2.58
5 pb2 1 2004-04-05 18:00:00 0 0
6 pb2 1 2004-04-05 20:00:00 0.0833 0.0833
The key is to use the dplyr::lag()
function which takes the "line just above" in the table (which implies that you have to arrange()
it beforehand).
By wrapping this inside the ifelse
, I'm checking that land
and previous land
were not 0
(and that we are not in the first line of the id
, or lag(anything)
will be missing).
I then just reuse the lag()
function to get the cumtime.land
variable.