Search code examples
raggregatenested-loopsapplysummary

Nesting aggregate within apply to aggregate multiple columns by multiple variables in R


I have a dataframe with sets of scores, and sets of grouping variables, something like:

s1 s2 s3 g1 g2 g3
4  3  7  F   F  T
6  2  2  T   T  T
2  4  9  G   G  F
1  3  1  T   F  G

I want to run an aggregate, at the moment I'm doing:

aggregate(df[c("s1","s2","s3")],df["g1"],function(x) c(m =mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x)))

I'd like to have just one line of code, so I could aggregate the multiple variables by multiple factors all at once. Note I'm not trying to get a summary of s1-3 by combinations of g1-3 (as per answers here). I've looked at summaryBy in the doBy package, but again that seems to do combinations of each factor rather than just an overall which isn't what I want (useful though!). I've been playing with variants on:

apply(df[c("g1","g2","g3")], 2, function (z) aggregate(df[c("s1","s2","s3")],z,function(x) c(m =mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x)))

But I get the error: "'by' must be a list" with that. I think I could work out how to do this with a loop and I know with various versions of ddply or reshape you can get aggregation but the most intuitive way (to me at least) seems to be an apply and aggregate - what am I missing?


Solution

  • Let us name the anonymous function in the question as follows. Then the Map statement at the end applies aggregate to df[1:3] separately by each grouping variable:

    mean.sd.n <- function(x) c(m = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x))
    
    Map(function(nm) aggregate(df[1:3], df[nm], mean.sd.n), names(df)[4:6])
    

    giving:

    $g1
      g1     s1.m    s1.sd     s1.n      s2.m     s2.sd      s2.n      s3.m     s3.sd      s3.n
    1  F 4.000000       NA 1.000000 3.0000000        NA 1.0000000 7.0000000        NA 1.0000000
    2  G 2.000000       NA 1.000000 4.0000000        NA 1.0000000 9.0000000        NA 1.0000000
    3  T 3.500000 3.535534 2.000000 2.5000000 0.7071068 2.0000000 1.5000000 0.7071068 2.0000000
    
    $g2
      g2    s1.m   s1.sd    s1.n s2.m s2.sd s2.n     s3.m    s3.sd     s3.n
    1  F 2.50000 2.12132 2.00000    3     0    2 4.000000 4.242641 2.000000
    2  G 2.00000      NA 1.00000    4    NA    1 9.000000       NA 1.000000
    3  T 6.00000      NA 1.00000    2    NA    1 2.000000       NA 1.000000
    
    $g3
      g3     s1.m    s1.sd     s1.n      s2.m     s2.sd      s2.n     s3.m    s3.sd     s3.n
    1  F 2.000000       NA 1.000000 4.0000000        NA 1.0000000 9.000000       NA 1.000000
    2  G 1.000000       NA 1.000000 3.0000000        NA 1.0000000 1.000000       NA 1.000000
    3  T 5.000000 1.414214 2.000000 2.5000000 0.7071068 2.0000000 4.500000 3.535534 2.000000
    

    Note: This could be shortened slightly by using fn$ from the gsubfn package. It allows us to specify the anonymous function in the line of code that starts with Map using formula notation as shown:

    library(gsubfn)
    fn$Map(nm ~ aggregate(df[1:3], df[nm], mean.sd.n), names(df)[4:6])