Search code examples
rdplyrsurvival

Calculate conditional cumulative sum with lag


I have this data frame:

have <- data.frame(ID = c(1, 1, 1, 2, 3),
                 tstart = c(0, 100, 200, 0, 0),
                 tstop = c(100, 200, 500, 400, 300),
                 tdc = c(0,1,1,0,1))
> have
  ID tstart tstop tdc
1  1      0   100   0
2  1    100   200   1
3  1    200   500   1
4  2      0   400   0
5  3      0   300   1

I would like to add a column with the cumulative sum of time from the interval tstart to tstop, conditional on tdc=1, i.e.,

want <- data.frame(ID = c(1, 1, 1, 2, 3),
                 tstart = c(0, 100, 200, 0, 0),
                 tstop = c(100, 200, 500, 400, 300),
                 tdc = c(0,1,1,0,1),
                 tdc.cum = c(0,100,400,0,300))
> want
  ID tstart tstop tdc tdc.cum
1  1      0   100   0       0
2  1    100   200   1     100
3  1    200   500   1     400
4  2      0   400   0       0
5  3      0   300   1     300

It would also be helpful to see how to lag the cumulative time by 10 units, i.e., subtract 10 units from each ID's total cumulative sum of rows with tdc=1.

want2 <- data.frame(ID = c(1, 1, 1, 2, 3),
                 tstart = c(0, 100, 200, 0, 0),
                 tstop = c(100, 200, 500, 400, 300),
                 tdc = c(0,1,1,0,1),
                 tdc.cum = c(0,100,390,0,290))
> want2
  ID tstart tstop tdc tdc.cum
1  1      0   100   0       0
2  1    100   200   1     100
3  1    200   500   1     390
4  2      0   400   0       0
5  3      0   300   1     290

I have tried to set up this data frame using survival::tmerge() and cumtdc() but I have only been able to get the cumulative sum of tdc (1 or 0) instead of the time interval. Thank you.


Solution

  • Using dplyr, this will achieve what you need. First, it calculates the cumulative sum of start/stop differences by group (ID). Second, if the last value in tdc for each group != 0, subtract 10 from last value in "tdc.cum":

    library(dplyr)
    
    df <- data.frame(ID = c(1, 1, 1, 2, 3),
                     tstart = c(0, 100, 200, 0, 0),
                     tstop = c(100, 200, 500, 400, 300),
                     tdc = c(0,1,1,0,1))
    
    df %>%
      group_by(ID) %>%
      mutate(tdc.cum = cumsum(ifelse(tdc == 1, tstop - tstart, 0)),
             tdc.cum = ifelse(tdc.cum == max(tdc.cum) & tdc != 0, 
                              tdc.cum - 10, tdc.cum)) %>%
      ungroup()
             
    # A tibble: 5 × 5
         ID tstart tstop   tdc tdc.cum
      <dbl>  <dbl> <dbl> <dbl>   <dbl>
    1     1      0   100     0       0
    2     1    100   200     1     100
    3     1    200   500     1     390
    4     2      0   400     0       0
    5     3      0   300     1     290