Search code examples
rdataframebinarysurvival-analysis

Generating Data for Survival Analysis in r


I have a dataframe that record if an individual assumed a certain drug each year:

df_og <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3,3),
  year=c(2001,2002,2003,2001,2002,2003,2000,2001,2002,2003),
  med1=c(1,1,1,1,1,0,0,0,0,1),
  med2=c(0,0,0,0,0,1,0,0,1,0),
  med3=c(0,0,0,0,0,0,1,1,0,0)
)

that looks like this:

id  year   med1 med2 med3 
1   2001    1    0    0
1   2002    1    0    0
1   2003    1    0    0
2   2001    1    0    0
2   2002    1    0    0
2   2003    0    1    0
3   2000    0    0    1
3   2001    0    0    1
3   2002    0    1    0
3   2003    1    0    0

So id column shows id of the subject, year the year of observation, and the med1-2-3 variables are dummy with value =1 if the drug has been taken and =0 if not.

I'm trying to create a new dataframe:

  id = c(1,2,2,3,3,3),
  time = c(3,2,1,2,1,1),
  failure = c(0,1,0,1,1,0),
  group = c(1,1,2,3,2,1)) 

looks like:

  id  time failure med_group
   1   3      0        1
   2   2      1        1
   2   1      0        2
   3   2      1        3
   3   1      1        2
   3   1      0        1

where: id shows subject id, time counts the number of consecutive years a subject has been taking a certain drug, failure if in the given years a subject switched drug, med_group the drug the subject has been taking.

Examples:

  1. first row of df, subject id=1has taken med1 for 3 consecutive years, so time=3 and hasn't switched to others, so failure=0.
  2. second row of df, id=2 has been taking med1 for 2 consecutive years, so time=2, failure=0, med_group=1. But then switched to med2, so time=1, failure=1, and med_group=2.

and so on for the others. It's a tricky operation so I hope the question is clear enough.

Any suggestion will be welcomed! Cheers


Solution

  • We can get the data in long format, remove rows where value = 0, replace the last value in each id to 0 indicating no failure. We then group_by name to count number of rows in each group and if failure occurred or not.

    library(dplyr)
    
    df_og %>%
      tidyr::pivot_longer(cols = starts_with('med')) %>%
      filter(value != 0) %>%
      group_by(id) %>%
      mutate(value = replace(value, n(), 0)) %>%
      group_by(name, add = TRUE) %>%
      summarise(time = n(), 
                failure = +all(value == 1))
    
    
    #     id name   time failure
    #  <dbl> <chr> <int>   <int>
    #1     1 med1      3       0
    #2     2 med1      2       1
    #3     2 med2      1       0
    #4     3 med1      1       0
    #5     3 med2      1       1
    #6     3 med3      2       1