Search code examples
rdplyrtransformtidyr

Transform data to long format in R given survival time


Consider the following sample dataset.

*id represents an individual's identifier.

*Surv_time represents an individual's survival time

*start represents the time at which zj is measured. zj is a time-varying covariate.

rm(list=ls()); set.seed(1)
n<-5
Surv_time<-round( runif( n, 12 , 20  ) ) #Survival time
dat<-data.frame(id=1:n, Surv_time )
ntp<- rep(3, n) # three measurements per individual. 
mat<-matrix(ncol=2,nrow=1)
m=0; w <- mat
for(l in ntp)
{
  m=m+1
  ft<- seq(from = runif(1,0,8), to =  runif(1,12,20)  , length.out = l)
  seq<-round(ft)
  matid<-cbind( matrix(seq,ncol=1 ) ,m)
  w<-rbind(w,matid)
}

d<-data.frame(w[-1,])
colnames(d)<-c("start","id")
D <-  merge(d,dat,by="id") #merging dataset
D$zj <- with(D, 0.3*start)
D
   id start Surv_time  zj
1   1     7        14 2.1
2   1    13        14 3.9
3   1    20        14 6.0
4   2     5        15 1.5
5   2    11        15 3.3
6   2    17        15 5.1
7   3     0        17 0.0
8   3     7        17 2.1
9   3    14        17 4.2
10  4     1        19 0.3
11  4     9        19 2.7
12  4    17        19 5.1
13  5     3        14 0.9
14  5    11        14 3.3
15  5    18        14 5.4

I need a code to transform the data to the start-stop format where the last stop is at Surv_time for an individual. The idea is to create start-stop intervals where the stop of an interval is the start of the next interval. I should end up with

  id start    stop  Surv_time  zj 
1   1     7    13     14       2.1    
2   1    13    14     14       3.9   

4   2     5    11     15       1.5    
5   2    11    15     15       3.3   

7   3     0    7      17       0.0    
8   3     7    14     17       2.1    
9   3    14    17     17       4.2   

10  4     1    9      19       0.3    
11  4     9    17     19       2.7    
12  4    17    19     19       5.1   

13  5     3    11     14       0.9    
14  5    11    14     14       3.3   

Solution

  • We can use dplyr:

    library(dplyr)
    
    D %>% group_by(id) %>%
      mutate(stop = lead(start, default = Inf),
             stop = ifelse(stop > Surv_time, Surv_time, stop), .after = start) %>%
      filter(start < stop) %>%
      ungroup()
    
    # A tibble: 12 × 5
          id start  stop Surv_time    zj
       <dbl> <dbl> <dbl>     <dbl> <dbl>
     1     1     7    13        14   2.1
     2     1    13    14        14   3.9
     3     2     5    11        15   1.5
     4     2    11    15        15   3.3
     5     3     0     7        17   0  
     6     3     7    14        17   2.1
     7     3    14    17        17   4.2
     8     4     1     9        19   0.3
     9     4     9    17        19   2.7
    10     4    17    19        19   5.1
    11     5     3    11        14   0.9
    12     5    11    14        14   3.3