Assume the following time series Dataset:
DF <- data.frame(T0=c(2012, 2016, 2014),
T1=c(2017, NA, 2019),
Duration= c(5,3,5),
val12 =c(15,43,7),
val13 =c(16,44,8),
val14 =c(17,45,9),
val15 =c(18,46,10),
val16 =c(19,47,11),
val17 =c(20,48,12),
val18 =c(21,49,13),
val19 =c(22,50,14),
SumVal =c(105,194,69))
print(DF)
T0 T1 Duration val12 val13 val14 val15 val16 val17 val18 val19 SumVal
1 2012 2017 5 15 16 17 18 19 20 21 22 105
2 2016 NA 3 43 44 45 46 47 48 49 50 194
3 2014 2019 5 7 8 9 10 11 12 13 14 69
For building a duration model, I would like to aggregate the "valXX" variables into one SumVal variable according to their duration, like in the table above. The first SumVal (105) corresonds to val12+...+val17, as this is the given time interval (2012-2017) for the first observation.
NA's in T1 indicate that the event of interest did not occure yet and the observation is censored. In this case the Duration and SumVal will be based on the intervall T0:2019.
I struggle to implement a function in R which can performs this task on a very large dataframe.
Any help would be much appreciated!
Here's a tidyverse approach.
library(tidyverse)
DF %>%
# Track orig rows, and fill in NA T1's
mutate(row = row_number(),
T1 = if_else(is.na(T1), T0 + Duration, T1)) %>%
# Gather into long form
gather(col, value, val12:val19) %>%
# convert column names into years
mutate(year = col %>% str_remove("val") %>% as.numeric + 2000) %>%
# Only keep the rows within each duration
filter(year >= T0 & year <= T1) %>%
# Count total value by row, equiv to
# group_by(row) %>% summarize(SumVal2 = sum(value))
count(row, wt = value, name = "SumVal2")
# A tibble: 3 x 2
row SumVal2
<int> <dbl>
1 1 105
2 2 194
3 3 69