Say I have a dataset like this:
df <- data.frame(id = c(1, 1, 1, 2, 2),
classname = c("Welding", "Welding", "Auto", "HVAC", "Plumbing"),
hours = c(3, 2, 4, 1, 2))
I.e.,
id classname hours
1 1 Welding 3
2 1 Welding 2
3 1 Auto 4
4 2 HVAC 1
5 2 Plumbing 2
I'm trying to figure out how to summarize the data in a way that gives me, for each id, a list of the classes they took as well as how many hours of each class. I would want these to be in a list so I can keep it one row per id. So, I would want it to return:
id class.list class.hours
1 1 Welding, Auto 5,4
2 2 HVAC, Plumbing 1,2
I was able to figure out how to get it to return the class.list.
library(dplyr)
classes <- df %>%
group_by(id) %>%
summarise(class.list = list(unique(as.character(classname))))
This gives me:
id class.list
1 1 Welding, Auto
2 2 HVAC, Plumbing
But I'm not sure how I could get it to sum the number of hours for each of those classes (class.hours).
Thanks for your help!
In base R, this can be accomplished with two calls to aggregate
. The inner call sums the hours and the outer call "concatenates" the hours and the class names. In the outer call of aggregate
, cbind
is used to include both the hours and the class names in the output, and also to provide the desired variable names.
# convert class name to character variable
df$classname <- as.character(df$classname)
# aggregate
aggregate(cbind("class.hours"=hours, "class.list"=classname)~id,
data=aggregate(hours~id+classname, data=df, FUN=sum), toString)
id class.hours class.list
1 1 4, 5 Auto, Welding
2 2 1, 2 HVAC, Plumbing
In data.table
, roughly the same output is produced with a chained statement.
setDT(df)[, .(hours=sum(hours)), by=.(id, classname)][, lapply(.SD, toString), by=id]
id classname hours
1: 1 Welding, Auto 5, 4
2: 2 HVAC, Plumbing 1, 2
The variable names could then be set using the data.table
setnames
function.