Search code examples
rdatediffdata-preprocessing

Subtracting a date in one column from previous unique date in another column in R


I have a dataframe as following:

ID   DPREL          Dt_biop      
292  2012-06-11    2014-03-06    
292  2013-01-10    2014-03-06
292  2015-05-21    2014-03-06
292  2017-09-05    2014-03-06
292  2012-06-11    2015-05-21   
292  2012-09-07    2015-05-21
292  2012-09-07    2017-10-26
292  2013-01-10    2017-10-26
292  2015-05-21    2017-10-26
805  2013-09-09    2020-11-19
805  2020-03-01    2020-11-19

I need to add a new column "Diff" which is the difference between each DPREL and previous unique Biopsy date(Dt_biop), not exactly the previous row. In fact the new column should look like this:

ID   DPREL         Dt_biop       Diff
292  2012-06-11    2014-03-06    
292  2013-01-10    2014-03-06
292  2015-05-21    2014-03-06
292  2017-09-05    2014-03-06
292  2012-06-11    2015-05-21    -633
292  2012-09-07    2015-05-21    -545
292  2012-09-07    2017-10-26    -986
292  2013-01-10    2017-10-26    -861
292  2015-05-21    2017-10-26     0
805  2013-09-09    2020-11-19 
805  2020-03-01    2020-11-19

Note that the diff column should be created based on each ID. I already tried the following code.

DATA2 <- DATA2 %>%
 group_by(ID) %>%
 arrange(Dt_biop) %>%
 mutate(diff = DPREL - lag(Dt_biop, default = first(DPREL)))

Although it works, it calculates the difference between each DPREL and its previous Dt_biop row and this is not correct. It should calculate the difference between each DPPREL in each row with previous unique Dt_biop. For ID 292, we have 3 unique Dt_biop.

For the first biopsy in each ID, I do not need to calculate any diff becuase there is no previous biopsy. So, the diff value for these rows should be empty.


Solution

  • A tidyverse option using tidyr::fill.

    library(dplyr)
    library(tidyr)
    
    df %>%
      group_by(ID) %>%
      mutate(date = if_else(Dt_biop > lag(Dt_biop), lag(Dt_biop), NA_Date_)) %>%
      fill(date) %>%
      mutate(diff = DPREL-date) %>%
      select(-date)
    
    # # A tibble: 11 x 4
    # # Groups:   ID [2]
    #     ID   DPREL      Dt_biop    diff     
    #    <int> <date>     <date>     <drtn>   
    # 1    292 2012-06-11 2014-03-06   NA days
    # 2    292 2013-01-10 2014-03-06   NA days
    # 3    292 2015-05-21 2014-03-06   NA days
    # 4    292 2017-09-05 2014-03-06   NA days
    # 5    292 2012-06-11 2015-05-21 -633 days
    # 6    292 2012-09-07 2015-05-21 -545 days
    # 7    292 2012-09-07 2017-10-26 -986 days
    # 8    292 2013-01-10 2017-10-26 -861 days
    # 9    292 2015-05-21 2017-10-26    0 days
    # 10   805 2013-09-09 2020-11-19   NA days
    # 11   805 2020-03-01 2020-11-19   NA days
    

    Data

    df <- structure(list(ID = c(292L, 292L, 292L, 292L, 292L, 292L, 292L, 
    292L, 292L, 805L, 805L), DPREL = structure(c(15502, 15715, 16576, 
    17414, 15502, 15590, 15590, 15715, 16576, 15957, 18322), class = "Date"), 
        Dt_biop = structure(c(16135, 16135, 16135, 16135, 16576, 
        16576, 17465, 17465, 17465, 18585, 18585), class = "Date")), row.names = c(NA, 
    -11L), class = c("tbl_df", "tbl", "data.frame"))