I have a dataframe as following:
ID DPREL Dt_biop
292 2012-06-11 2014-03-06
292 2013-01-10 2014-03-06
292 2015-05-21 2014-03-06
292 2017-09-05 2014-03-06
292 2012-06-11 2015-05-21
292 2012-09-07 2015-05-21
292 2012-09-07 2017-10-26
292 2013-01-10 2017-10-26
292 2015-05-21 2017-10-26
805 2013-09-09 2020-11-19
805 2020-03-01 2020-11-19
I need to add a new column "Diff" which is the difference between each DPREL and previous unique Biopsy date(Dt_biop), not exactly the previous row. In fact the new column should look like this:
ID DPREL Dt_biop Diff
292 2012-06-11 2014-03-06
292 2013-01-10 2014-03-06
292 2015-05-21 2014-03-06
292 2017-09-05 2014-03-06
292 2012-06-11 2015-05-21 -633
292 2012-09-07 2015-05-21 -545
292 2012-09-07 2017-10-26 -986
292 2013-01-10 2017-10-26 -861
292 2015-05-21 2017-10-26 0
805 2013-09-09 2020-11-19
805 2020-03-01 2020-11-19
Note that the diff column should be created based on each ID. I already tried the following code.
DATA2 <- DATA2 %>%
group_by(ID) %>%
arrange(Dt_biop) %>%
mutate(diff = DPREL - lag(Dt_biop, default = first(DPREL)))
Although it works, it calculates the difference between each DPREL and its previous Dt_biop row and this is not correct. It should calculate the difference between each DPPREL in each row with previous unique Dt_biop. For ID 292, we have 3 unique Dt_biop.
For the first biopsy in each ID, I do not need to calculate any diff becuase there is no previous biopsy. So, the diff value for these rows should be empty.
A tidyverse option using tidyr::fill
.
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(date = if_else(Dt_biop > lag(Dt_biop), lag(Dt_biop), NA_Date_)) %>%
fill(date) %>%
mutate(diff = DPREL-date) %>%
select(-date)
# # A tibble: 11 x 4
# # Groups: ID [2]
# ID DPREL Dt_biop diff
# <int> <date> <date> <drtn>
# 1 292 2012-06-11 2014-03-06 NA days
# 2 292 2013-01-10 2014-03-06 NA days
# 3 292 2015-05-21 2014-03-06 NA days
# 4 292 2017-09-05 2014-03-06 NA days
# 5 292 2012-06-11 2015-05-21 -633 days
# 6 292 2012-09-07 2015-05-21 -545 days
# 7 292 2012-09-07 2017-10-26 -986 days
# 8 292 2013-01-10 2017-10-26 -861 days
# 9 292 2015-05-21 2017-10-26 0 days
# 10 805 2013-09-09 2020-11-19 NA days
# 11 805 2020-03-01 2020-11-19 NA days
Data
df <- structure(list(ID = c(292L, 292L, 292L, 292L, 292L, 292L, 292L,
292L, 292L, 805L, 805L), DPREL = structure(c(15502, 15715, 16576,
17414, 15502, 15590, 15590, 15715, 16576, 15957, 18322), class = "Date"),
Dt_biop = structure(c(16135, 16135, 16135, 16135, 16576,
16576, 17465, 17465, 17465, 18585, 18585), class = "Date")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))