Search code examples
raggregateplyrgroup-summaries

Summarize data by a subset of grouping variable


I would like too summarize one column by another, which I know how to do, using either ddply or aggregate...But, I want to also add groups within my grouping variable to summarize the data in a custom fashion, see below:

##Create Data
year<-c(2008,2008,2008,2008,2009,2009,2009,2010,2010,2010,2010,2011,2011,2011,2011,2011)
catch<-c(1,2,0,1,1,0,1,0,1,2,1,0,1,1,1,0)
df<-data.frame(year,catch)

##I know how to aggregate by year:
yearlyAVcap<-ddply(df,~year,summarise,mean=mean(catch),sd=sd(catch))

##But I'd like to have 2008,2009,2010 summarized together and 2011 on it's own (i,e two groups-    1=2008,2009,2010 ; 2=2011)

Solution

  • You could try

    df1 <- transform(df, yrGroup=c('2008-2010', '2011')[(year==2011)+1])
    
    
    library(plyr)
    ddply(df1, .(yrGroup), summarise, mean=mean(catch), sd=sd(catch))
    #  yrGroup        mean        sd
    #1 2008-2010 0.9090909 0.7006490
    #2      2011 0.6000000 0.5477226
    

    Or using dplyr

     df1 %>%
          group_by(yrGroup) %>%
          summarise_each(funs(mean, sd), catch)
     #summarise_each can be used for multiple columns 
    #    yrGroup      mean        sd
    #1 2008-2010 0.9090909 0.7006490
    #2      2011 0.6000000 0.5477226
    

    Explanation

    To get two groups from different years, in the example (2008-2010 and 2011), we can use == to match for the second group year i.e. 2011. The result will be a logical vector. Then, we can transform this to numeric index by adding 1 to it so that the two groups will 1 and 2. If we add 0, the TRUE will be converted to 1 and FALSE as 0. This numeric index can be used for renaming it to 2008-2010 and 2011.

    df$year==2011
     #[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
    #[13]  TRUE  TRUE  TRUE  TRUE
    
    (df$year==2011)+1
     #[1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
    
     c('2008-2010', '2011')[(df$year==2011)+1]
     #[1] "2008-2010" "2008-2010" "2008-2010" "2008-2010" "2008-2010" "2008-2010"
     #[7] "2008-2010" "2008-2010" "2008-2010" "2008-2010" "2008-2010" "2011"     
     #[13] "2011"      "2011"      "2011"      "2011"     
    

    Suppose, we have multiple years, for example 2008-2010 as one group and 2011-2013 as another group, we can use %in%

      (df$year %in% 2011:2013)+1