Search code examples
rdataframereshapereshape2dcast

Convert the event-level dataset to the patient-level data in r


I need to convert the event-level dataset to the patient-level data i.e. convert a long dataset to a wider one based on deidnum as a key variable. Additionally, I want to create columns for each produced event and its event time. In case the same patient has more than one event, consider the earliest event time.

Here is a similar data sample and my code:

df <- read.table(text = "deidnum,eventc,EVENTDT,MI_COMPLICATED
325107,MI,21,1
325107,New Rose Dyspnea Scale 2 or more,1468,NA
418351,New Rose Dyspnea Scale 2 or more,207,NA
839172,New Rose Dyspnea Scale 2 or more,1060,NA
839172,New Rose Dyspnea Scale 2 or more,1718,NA
1487422,MI,990,0
1487422,DEATH,1113,NA
1511165,MI,424,0
1511165,MI,608,1
1511165,New Rose Dyspnea Scale 2 or more,721,NA
", sep = ",", header = TRUE)

library(reshape2)
wide.df <- dcast(df, deidnum ~ eventc)
wide.df

The current output

  deidnum DEATH MI New Rose Dyspnea Scale 2 or more
1  325107     0  1                                1
2  418351     0  0                                1
3  839172     0  0                                2
4 1487422     1  1                                0
5 1511165     0  2                                1

The expected output: enter image description here

Any advice will be greatly appreciated.


Solution

  • A tidyverse workflow:

    library(tidyr)
    library(dplyr)
    
    df %>%
      slice_min(EVENTDT, by = c(deidnum, eventc)) %>%
      pivot_wider(id_cols = deidnum, names_from = eventc,
                  values_from = c(eventc, EVENTDT),
                  values_fn = list(eventc = length),
                  values_fill = list(eventc = 0),
                  unused_fn = first) %>%
      rename_with(~ sub("eventc_", "", .x), starts_with("eventc"))
    
    # # A tibble: 5 × 8
    #   deidnum    MI `New Rose Dyspnea Scale 2 or more` DEATH EVENTDT_MI `EVENTDT_New Rose Dyspnea Scale 2 or more` EVENTDT_DEATH MI_COMPLICATED
    #     <int> <int>                              <int> <int>      <int>                                      <int>         <int>          <int>
    # 1  325107     1                                  1     0         21                                       1468            NA              1
    # 2  418351     0                                  1     0         NA                                        207            NA             NA
    # 3  839172     0                                  1     0         NA                                       1060            NA             NA
    # 4 1487422     1                                  0     1        990                                         NA          1113              0
    # 5 1511165     1                                  1     0        424                                        721            NA              0
    

    Note: unused_fn = first is for grouping by the id_cols columns (deidnum) then summarizing the unused columns (MI_COMPLICATED) using first() (assuming it has been ordered by EVENTDT).