Search code examples
raggregateplyrsummarize

Aggregate data frame by column, filtering on a different column


I want to aggregate some columns of a data frame using a factor (group in the example) but I want to use only the rows with the highest values in a different column (time in the example)

df=data.frame(group=c(rep('a',5),rep('b',5)),
           time=c(1:5,2:6),
           V1=c(1,1,1,2,2,1,1,1,1,1),
           V2=c(2,2,1,1,1,1,1,1,1,5))

I know how to do it using ddply but it's pretty slow

ddply(df,'group',summarize,
      V1=sum(V1[order(time,decreasing = T)[1:2]]),
      V2=sum(V2[order(time,decreasing = T)[1:2]]))

"group" "V1"    "V2"
"a" 4   2
"b" 2   6

Is there a faster way to do it (aggregate or data.table)?


Solution

  • We can arrange the data by time, group_by time and sum top 2 values using tail.

    This can be done using dplyr :

    library(dplyr)
    
    df %>%
      arrange(group, time) %>%
      group_by(group) %>%
      summarise_at(vars(V1:V2), ~sum(tail(., 2)))
    
    #  group    V1    V2
    #  <fct> <dbl> <dbl>
    #1 a         4     2
    #2 b         2     6
    

    and in data.table as :

    library(data.table)
    setDT(df)[order(group, time), lapply(.SD, function(x) sum(tail(x, 2))), 
              .SDcols = c('V1', 'V2'), group]