Search code examples
rdataframecox-regressionwrangle

Wrangle data from long to wide format for cox regression in R


I am trying to wrangle some data for a cox regression...

#generate some data
set.seed(1)
ID <- sort(rep(1:10, times = 5))
conditions <- rep(c("asthma", "copd", "af", "cvd", "ckd"), times = 10)
day <- sample(1:100, 50)

#assign to dataframe
df <- data.frame(ID, conditions, day)

I have data in a long format like this:

ID conditions day
1 asthma 68
1 copd 39
1 af 1
1 cvd 34
1 ckd 87
2 asthma 43
2 copd 14
2 af 82
2 cvd 59
2 ckd 51

And I need it wrangled to this: As you can see, ID=1 develops AF on day 1, cvd on day 34 and copd on day 39…
So assuming that this is in order of date…
In rownum 1, the af column changes to 1…
In rownum 2, the af AND cvd changes to 1…
In rownum3, the af AND cvd AND copd changes to 1…
Then it would be the same kind of pattern for all the other IDs.

rownum ID day asthma copd af cvd
1 1 1 0 0 1 0
2 1 34 0 0 1 1
3 1 39 0 1 1 1
4 1 68 1 1 1 1
5 2 14 0 1 0 0
6 2 43 1 1 0 0
7 2 51 1 1 0 1

I've tried using a lag function, but it just doesn't work... the lag needs to work for multiple columns as you can see above.

dt[,temp:=ifelse(is.na(reglag(event_dt,1)), as.integer(0), reglag(event_dt,1)), by=ID]
dt[, sequence:=cumsum(temp)+1, by=ID]

func = function(x)
{
  which(c(1,lag(x,1)[-1]) %in% 1) %>%
    c(length(x)+1) %>% 
    diff
}

reglag = function(x,lag) {c(rep(NA,lag), x[lag:(length(x)-1)])}

dt[, cond.time:=func(event_dt) %>% lapply(seq) %>% unlist, by=ID]

Would be very grateful for any help you could give. I also have a massive table, so maybe a loop would cause me memory issues...

Many many thanks in advance ~R


Solution

  • Arrange by ID and day, and pivot_wider. You'll get 1 for the disease at day d, 0 elsewhere. Use cumsum to add 1 to the consecutive values for each column.

    library(dplyr)
    library(tidyr)
    df %>% 
      arrange(ID, day) %>% 
      mutate(value = 1) %>% 
      pivot_wider(names_from = conditions, values_fill = 0) %>% 
      group_by(ID) %>% 
      mutate(across(af:ckd, cumsum))
    

    output

         ID   day    af   cvd  copd asthma   ckd
       <int> <int> <dbl> <dbl> <dbl>  <dbl> <dbl>
     1     1     1     1     0     0      0     0
     2     1    34     1     1     0      0     0
     3     1    39     1     1     1      0     0
     4     1    68     1     1     1      1     0
     5     1    87     1     1     1      1     1
     6     2    14     0     0     1      0     0
     7     2    43     0     0     1      1     0
     8     2    51     0     0     1      1     1
     9     2    59     0     1     1      1     1
    10     2    82     1     1     1      1     1
    # … with 40 more rows