Search code examples
rdplyrspread

Spreading data that is grouped by ID but having different observations


I have this data:

drugData <- data.frame(caseID=c(9, 9, 10, 11, 12, 12, 12, 12, 13, 45, 45, 225),
            Drug=c("Cocaine", "Cocaine", "DPT", "LSD", "Cocaine", "LSD", "Heroin","Heroin", "LSD", "DPT", "DPT", "Heroin"),
             County=c("A", "A", "B", "C", "D", "D", "D","D", "E", "F", "F", "G"),
             Date=c(2009, 2009, 2009, 2009, 2011, 2011, 2011, 2011, 2010, 2010, 2010, 2005))

"CaseID" rows make up a single case, which may have observations of all the same drug, or different types of drugs. I want this data to look like the following:

CaseID  Drug.1     Drug.2    Drug. 3   Drug.4    County   Date
9       Cocaine    Cocaine   NA        NA        A        2009
10      DPT        LSD       NA        NA        B        2009
11      LSD        NA        NA        NA        C        2009
12      Cocaine    LSD       Heroin    Heroin    D        2011
13      LSD        NA        NA        NA        E        2010
45      DPT        DPT       NA        NA        F        2010
225     Heroin     NA        NA        NA        G        2005

I've tried using dplyr spread function but can't seem to quite get this to work.


Solution

  • We can pivot to wide format after creating a sequence column based on 'caseID'

    library(dplyr)
    library(tidyr)
    library(stringr)
    library(data.table)
    drugData %>%
       mutate(nm = str_c('Drug', rowid(caseID))) %>% 
       pivot_wider(names_from = nm, values_from = Drug) 
    #A tibble: 7 x 7
    #  caseID County  Date Drug1   Drug2   Drug3  Drug4 
    #   <dbl> <fct>  <dbl> <fct>   <fct>   <fct>  <fct> 
    #1      9 A       2009 Cocaine Cocaine <NA>   <NA>  
    #2     10 B       2009 DPT     <NA>    <NA>   <NA>  
    #3     11 C       2009 LSD     <NA>    <NA>   <NA>  
    #4     12 D       2011 Cocaine LSD     Heroin Heroin
    #5     13 E       2010 LSD     <NA>    <NA>   <NA>  
    #6     45 F       2010 DPT     DPT     <NA>   <NA>  
    #7    225 G       2005 Heroin  <NA>    <NA>   <NA>  
    

    Or with spread (spread is deprecated in place of pivot_wider

    drugData %>%
       mutate(nm = str_c('Drug', rowid(caseID))) %>% 
       spread(nm, Drug)
    

    Or using data.table

    dcast(setDT(drugData), caseID + County  + Date ~
             paste0('Drug', rowid(caseID)), value.var = 'Drug')
    #   caseID County Date   Drug1   Drug2  Drug3  Drug4
    #1:      9      A 2009 Cocaine Cocaine   <NA>   <NA>
    #2:     10      B 2009     DPT    <NA>   <NA>   <NA>
    #3:     11      C 2009     LSD    <NA>   <NA>   <NA>
    #4:     12      D 2011 Cocaine     LSD Heroin Heroin
    #5:     13      E 2010     LSD    <NA>   <NA>   <NA>
    #6:     45      F 2010     DPT     DPT   <NA>   <NA>
    #7:    225      G 2005  Heroin    <NA>   <NA>   <NA>