Search code examples
raggregate

Aggregating Data in R with user defined function


I have grouped data in R using the aggregate method.

Avg=aggregate(x$a, by=list(x$b,x$c),FUN= mean)

This gives me the mean for all the values of 'a' grouped by 'b' and 'c' of data frame 'x'.

Now, instead of taking the average of all values of 'a', I want to take the average of 3 maximum values of 'a' grouped by 'b' and 'c' .

Sample data set

a    b    c
10   G    3 
20   G    3 
22   G    3
10   G    3 
15   G    3
25   G    3
30   G    3

After the above Aggregate function, it will give me:

Group.1    Group.2    x
  G          3       18.85

But I want to take just the maximum 5 values of 'a' for the average

Group.1    Group.2    x
  G          3       22.40

I am not able to accommodate the below maximum function that I am using in the Aggregate function

index <- order(vector, decreasing = T)[1:5]
vector(index)

Can anyone please throw some light on if this is possible?


Solution

  • You can order the data, get the top 5 entries (using head) and then apply the mean:

    aggregate(x$a, by=list(x$b,x$c),FUN= function(x) mean(head(x[order(-x)], 5)))
    #  Group.1 Group.2    x
    #1       G       3 22.4
    

    If you want to do this with a custom function, I would do it like this:

    myfunc <- function(vec, n){
      mean(head(vec[order(-vec)], n))
    }
    
    aggregate(x$a, by=list(x$b,x$c),FUN= function(z) myfunc(z, 5))
    #  Group.1 Group.2    x
    #1       G       3 22.4
    

    I actually prefer using the formula style in aggregate which would look like this (I also use with() to be able to refer to the column names directly without using x$ each time):

    with(x, aggregate(a ~ b + c, FUN= function(z) myfunc(z, 5)))
    #  b c    a
    #1 G 3 22.4
    

    In this function, the parameter z is passed each a-vector based on groups of b and c. Does that make more sense now? Also note that it doesn't return an integer here but a numeric (decimal, 22.4 in this case) value.