Search code examples
rsurvival-analysisrecode

Data transformation: I am looking for an efficient way in R to recode/expand many-to-one for survival analysis


I am looking at graft patency after surgery (CABG)

In a CABG procedure, a single patient will typically get more than one graft (bypass), and we are looking at time-to-failure. This is indicated in the raw data by a variable indicating number of failed grafts, and the time at which diagnosed.

My raw data is currently one-line-per-patient and I believe I need to make it one-line-per-graft in order to continue to KM and Cox analyses. I am considering assorted if/then loops, but wonder if there is a more-efficient way to recode here.

Example data:

Patient VeinGrafts   VeinsOccluded   Months
   1        2               0           36
   2        4               1           34
   3        3               2           38
   4        4               0           33

In order to look at this "per vein" I need to recode such that each #VeinGraft gets its own row, and VeinsOccluded becomes 1/0

I need each row replicated (VeinGrafts) times, such that patient 2 will have 4 rows, but one of them has the VeinsOccluded indicator and the other 3 do not

This is what I would need the above data to look like for my next analytic move.

Patient VeinGrafts   VeinsOccluded   Months
   1        2               0           36
   1        2               0           36
   2        4               1           34
   2        4               0           34
   2        4               0           34
   2        4               0           34
   3        3               1           38
   3        3               1           38
   3        3               0           38
   4        4               0           33
   4        4               0           33
   4        4               0           33
   4        4               0           33

This community has been so incredibly helpful to this point, but I have not been able to find a similar question answered - if I have overlooked I apologize, but most certainly appreciate any ideas you may have!


Solution

  • We can uncount to expand the data, then grouped by 'Patient', mutate the 'VeinsOccluded' by creating a logical expression with row_number() on the first value of 'VeinsOccluded', coerced to binary with +

    library(dplyr)
    library(tidyr)
    df1 %>%
        uncount(VeinGrafts, .remove = FALSE) %>%
        group_by(Patient) %>% 
        mutate(VeinsOccluded = +(row_number() <= first(VeinsOccluded))) %>%
        ungroup %>%
        select(names(df1))
    

    -output

    # A tibble: 13 x 4
    #   Patient VeinGrafts VeinsOccluded Months
    #     <int>      <int>         <int>  <int>
    # 1       1          2             0     36
    # 2       1          2             0     36
    # 3       2          4             1     34
    # 4       2          4             0     34
    # 5       2          4             0     34
    # 6       2          4             0     34
    # 7       3          3             1     38
    # 8       3          3             1     38
    # 9       3          3             0     38
    #10       4          4             0     33
    #11       4          4             0     33
    #12       4          4             0     33
    #13       4          4             0     33
    

    Or this can be done with data.table (probably in a more efficient way)

    library(data.table)
    setDT(df1)[rep(seq_len(.N), VeinGrafts)][,
       VeinsOccluded := +(seq_len(.N) <= first(VeinsOccluded)), Patient][]
    

    -output

    #      Patient VeinGrafts VeinsOccluded Months
    # 1:       1          2             0     36
    # 2:       1          2             0     36
    # 3:       2          4             1     34
    # 4:       2          4             0     34
    # 5:       2          4             0     34
    # 6:       2          4             0     34
    # 7:       3          3             1     38
    # 8:       3          3             1     38
    # 9:       3          3             0     38
    #10:       4          4             0     33
    #11:       4          4             0     33
    #12:       4          4             0     33
    #13:       4          4             0     33
    

    data

    df1 <- structure(list(Patient = 1:4, VeinGrafts = c(2L, 4L, 3L, 4L), 
        VeinsOccluded = c(0L, 1L, 2L, 0L), Months = c(36L, 34L, 38L, 
        33L)), class = "data.frame", row.names = c(NA, -4L))