I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.
I've tried:
df.new <- group_by(df,"ID")
and also:
library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]
and unfortunately neither have worked. Any help is appreciated!
Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016