Here is the simple problem I'm trying to solve: I have a data.table like following table, and I'm trying to use dcast.data.table
function to calculate number of advancement for each group, but also I'm interested to calculate median
of grades in each group:
set.seed(10);
DT = data.table(GROUP = sample(c("a","b","c"),100,replace = T),
ADVANCED = sample(c("ADVANCED","DROP"),100,replace = T),
GRADE = sample(1:10,100, replace=T))
GROUP ADVANCED GRADE
1: b ADVANCED 3
2: a ADVANCED 6
3: b ADVANCED 7
4: c ADVANCED 9
95: b DROP 6
96: c ADVANCED 5
97: a DROP 10
98: b ADVANCED 1
99: c DROP 6
100: a DROP 2
GROUP ADVANCED GRADE
Essentially here is the result I'm looking for:
result = merge(
dcast.data.table(DT,.Primitive("~")(GROUP,ADVANCED)),
dcast.data.table(DT,.Primitive("~")(GROUP,.),
value.var="GRADE",
fun.aggregate=median));
setnames(result,".","MEDIAN_GRADE")
GROUP ADVANCED DROP MEDIAN_GRADE
1: a 17 19 6
2: b 20 21 7
3: c 13 10 6
Now I'm wondering how can I do it without making two separate dcast tables and merge at the end. I'm dealing with many row and column in my tables and grouping by key is a bottleneck. I'm wondering is there a better way to calculate this?
** Since my first question was vague I edit completely (thanks to Frank and Akrun for their feedback).
For the updated question
setnames(dcast(DT, GROUP~ADVANCED, length)[dcast(DT, GROUP~., median),
on = "GROUP"], ".", "MEDIAN_GRADE")[]
# GROUP ADVANCED DROP MEDIAN_GRADE
#1: a 17 19 6
#2: b 20 21 7
#3: c 13 10 6
Or a faster approach would be to group by 'GROUP', get the median
of 'GRADE' and then do the join on
the dcast
output
DT[,.(MEDIAN_GRADE = median(GRADE)) , .(GROUP)][
dcast(DT, GROUP ~ ADVANCED, length), on = 'GROUP']