I want to aggregate some columns of a data frame using a factor (group
in the example) but I want to use only the rows with the highest values in a different column (time
in the example)
df=data.frame(group=c(rep('a',5),rep('b',5)),
time=c(1:5,2:6),
V1=c(1,1,1,2,2,1,1,1,1,1),
V2=c(2,2,1,1,1,1,1,1,1,5))
I know how to do it using ddply
but it's pretty slow
ddply(df,'group',summarize,
V1=sum(V1[order(time,decreasing = T)[1:2]]),
V2=sum(V2[order(time,decreasing = T)[1:2]]))
"group" "V1" "V2"
"a" 4 2
"b" 2 6
Is there a faster way to do it (aggregate
or data.table
)?
We can arrange
the data by time
, group_by
time
and sum
top 2 values using tail
.
This can be done using dplyr
:
library(dplyr)
df %>%
arrange(group, time) %>%
group_by(group) %>%
summarise_at(vars(V1:V2), ~sum(tail(., 2)))
# group V1 V2
# <fct> <dbl> <dbl>
#1 a 4 2
#2 b 2 6
and in data.table
as :
library(data.table)
setDT(df)[order(group, time), lapply(.SD, function(x) sum(tail(x, 2))),
.SDcols = c('V1', 'V2'), group]