Search code examples
rconsolidation

R: Consolidating duplicate observations?


I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.

I've tried:

df.new <- group_by(df,"ID")

and also:

library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]

and unfortunately neither have worked. Any help is appreciated!


Solution

  • Using basic R:

    df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
                    day = c("1","2","3","4","5","6","7","8"),
                    year = c(2016,2017,2017,2016,2017,2016,2017,2016),
                    stringsAsFactors = F)
    
    > df
      ID day year
    1  a   1 2016
    2  a   2 2017
    3  b   3 2017
    4  b   4 2016
    5  b   5 2017
    6  c   6 2016
    7  d   7 2017
    8  d   8 2016
    

    Do:

    z = aggregate(df[,2:3], 
                  by = list(id = df$ID), 
                  function(x){ paste0(x, collapse = "/") }
                  )
    

    Result:

    > z
      id   day           year
    1  a   1/2      2016/2017
    2  b 3/4/5 2017/2016/2017
    3  c     6           2016
    4  d   7/8      2017/2016
    

    EDIT

    If you want to avoid "collapsing" NA do:

    z = aggregate(df[,2:3], 
                  by = list(id = df$ID), 
    
    
            function(x){ paste0(x[!is.na(x)],collapse = "/") })
    

    For a data frame like:

    > df
      ID  day year
    1  a    1 2016
    2  a    2   NA
    3  b    3 2017
    4  b    4 2016
    5  b <NA> 2017
    6  c    6 2016
    7  d    7 2017
    8  d    8 2016
    

    The result is:

    > z
      id day           year
    1  a 1/2           2016
    2  b 3/4 2017/2016/2017
    3  c   6           2016
    4  d 7/8      2017/2016