Search code examples
rtime-seriessurvival-analysis

Calculating the sum of different columns for every observation based on a time variable


Assume the following time series Dataset:

DF <- data.frame(T0=c(2012, 2016, 2014),
             T1=c(2017, NA, 2019),
             Duration= c(5,3,5),
             val12 =c(15,43,7),
             val13 =c(16,44,8),
             val14 =c(17,45,9),
             val15 =c(18,46,10),
             val16 =c(19,47,11),
             val17 =c(20,48,12),
             val18 =c(21,49,13),
             val19 =c(22,50,14),
             SumVal =c(105,194,69))

print(DF)

    T0   T1 Duration val12 val13 val14 val15 val16 val17 val18 val19 SumVal
1 2012 2017        5    15    16    17    18    19    20    21    22    105
2 2016   NA        3    43    44    45    46    47    48    49    50    194
3 2014 2019        5     7     8     9    10    11    12    13    14     69

For building a duration model, I would like to aggregate the "valXX" variables into one SumVal variable according to their duration, like in the table above. The first SumVal (105) corresonds to val12+...+val17, as this is the given time interval (2012-2017) for the first observation.

NA's in T1 indicate that the event of interest did not occure yet and the observation is censored. In this case the Duration and SumVal will be based on the intervall T0:2019.

I struggle to implement a function in R which can performs this task on a very large dataframe.

Any help would be much appreciated!


Solution

  • Here's a tidyverse approach.

    library(tidyverse)
    DF %>%
      # Track orig rows, and fill in NA T1's
      mutate(row = row_number(), 
             T1 = if_else(is.na(T1), T0 + Duration, T1)) %>%
      # Gather into long form
      gather(col, value, val12:val19) %>%
      # convert column names into years
      mutate(year = col %>% str_remove("val") %>% as.numeric + 2000) %>%
      # Only keep the rows within each duration
      filter(year >= T0 & year <= T1) %>%
      # Count total value by row, equiv to 
      # group_by(row) %>% summarize(SumVal2 = sum(value))
      count(row, wt = value, name = "SumVal2")
    
    # A tibble: 3 x 2
        row SumVal2
      <int>   <dbl>
    1     1     105
    2     2     194
    3     3      69