Search code examples
rformatreshapedcastspread

R long to wide format factor levels as binary variables and dates


I want to make a long to a wide format and use the factor Levels as binary variables. This means, if the factor Level is existing at least once, then there should be a 1 in the variable. Otherwise a 0. In addition, I want the dates as variable values date.1, date.2,...

What I have is the following

data_sample <- data.frame(
  PatID  = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
  date   = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
  status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)

What I want is:

PatID  COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1  date.Cardio.2  date.Cerebro.1
1        1    0       0    2016-12-14  2017-02-04     NA               NA          NA
2        0    1       0      NA           NA        2012-03-27         NA          NA 
3        1    1       1    2012-04-21     NA        2010-02-03    2011-03-05      2014-08-25      

Solution

  • There are a few step to take but this should give you your desired output.

    Note however that there seems to be a typo in the input data: I assume you meant "COPD" instead of "CPOD" because this is what you expected output tells me.

    The first step is to make the string "NA" an explicit missing value, i.e. NA.

    data_sample[data_sample == "NA"] <- NA
    

    Now use data.table::dcast for the reshaping.

    library(data.table)  
    setDT(data_sample)
    
    # create id column
    data_sample[, id := rowid(status), by = PatID]
    dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
    dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")
    

    Finally join both data.tables

    out <- dt1[dt2, on = 'PatID']
    out
    #  PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
    #1:     1      0       0    1  2016-12-14  2017-02-04          <NA>          <NA>           <NA>
    #2:     2      1       0    0        <NA>        <NA>    2012-27-03          <NA>           <NA>
    #3:     3      1       1    1  2012-04-21        <NA>    2010-02-03    2011-03-05     2014-08-25
    

    data

    data_sample <- data.frame(
      PatID   = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
      date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
      status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))