I am looking to determine the difference in days by groups across two columns and two rows. Essentially subtract from the End Day by the subsequent Start Day in the subsequent row and record the difference as new column in the data frame and start over when a new group (ID) has been identified.
Start_Date End_Date ID
2014-05-09 2015-05-08 01
2015-05-09 2016-05-08 01
2016-05-11 2017-05-10 01
2017-05-11 2018-05-10 01
2016-08-29 2017-08-28 02
2017-08-29 2018-08-28 02
The result should be something like table below.
Start_Date End_Date ID Days_Difference
2014-05-09 2015-05-08 01 NA
2015-05-09 2016-05-08 01 01
2016-05-11 2017-05-10 01 03
2017-05-11 2018-05-10 01 01
2016-08-29 2017-08-28 02 NA
2017-08-29 2018-08-28 02 01
Essentially I want to take the difference of the End Date and its left diagonal Start date across groups (ID). I am having a really hard time with this one. I don't think my code would be helpful. Any solution using tidyverse, data.table, or base R would be greatly appreciated!
We may get the difference between the lead
(next element) of 'Start_Date' and 'End_Date' after grouping
library(dplyr)
df1 <- df1 %>%
mutate(across(ends_with("Date"), as.Date)) %>%
group_by(ID) %>%
mutate(Days_Difference = as.numeric(lag(lead(Start_Date) - End_Date))) %>%
ungroup
-output
df1
# A tibble: 6 × 4
Start_Date End_Date ID Days_Difference
<date> <date> <int> <dbl>
1 2014-05-09 2015-05-08 1 NA
2 2015-05-09 2016-05-08 1 1
3 2016-05-11 2017-05-10 1 3
4 2017-05-11 2018-05-10 1 1
5 2016-08-29 2017-08-28 2 NA
6 2017-08-29 2018-08-28 2 1
Or a similar logic with data.table
library(data.table)
setDT(df1)[, Days_Difference :=
as.numeric(shift(shift(as.IDate(Start_Date), type = "lead") -
as.IDate(End_Date))), ID]
-output
> df1
Start_Date End_Date ID Days_Difference
<char> <char> <int> <num>
1: 2014-05-09 2015-05-08 1 NA
2: 2015-05-09 2016-05-08 1 1
3: 2016-05-11 2017-05-10 1 3
4: 2017-05-11 2018-05-10 1 1
5: 2016-08-29 2017-08-28 2 NA
6: 2017-08-29 2018-08-28 2 1
df1 <- structure(list(Start_Date = c("2014-05-09", "2015-05-09", "2016-05-11",
"2017-05-11", "2016-08-29", "2017-08-29"), End_Date = c("2015-05-08",
"2016-05-08", "2017-05-10", "2018-05-10", "2017-08-28", "2018-08-28"
), ID = c(1L, 1L, 1L, 1L, 2L, 2L)), class = "data.frame",
row.names = c(NA,
-6L))