Search code examples
raggregatesummary

Nesting a SUM and a MEAN in an aggregate to get means of scores per group


I can't find a dataset similar to my problem so I changed the dataset Iris (dataset in R) to look similar - it's close enough!

data = iris
data$type = gl(5,30,150,labels=c("group1","group2","group3","group4","group5"))
data$ID = gl(30,5,150)

Then I used the following code

xtabs(Sepal.Length ~ Species + type, aggregate(Sepal.Length ~ Species + type + ID, data, mean))

which results in

type
Species      group1 group2 group3 group4 group5
  setosa      30.16  19.90   0.00   0.00   0.00
  versicolor   0.00  12.20  35.88  11.28   0.00
  virginica    0.00   0.00   0.00  26.24  39.64

My understanding is that what my code is doing is adding together Sepal.Length for each ID then taking the mean of those values by each of Species and type.

Is this correct?

If not, how would I get this?

Additionally, how would I get this if my data is such that each ID has multiple types? (can't figure out how to construct this in R)

Actually, just to be perfectly clear

What I want is a code that sums together the Sepal.Length for each ID AND type then it will take the average of those sums over all IDs and post an average Sepal.Length by type and species/


Solution

  • With data.table:

    library(data.table)
    setDT(data)
    
    #sum of Sepal.Length for each ID AND type
    data[, id_type_sum := sum(Sepal.Length), by = .(ID, type)]
    
    # mean of this variable by type and species
    data[, mean(id_type_sum), by = .(type, Species)]
    
    #   type    Species       V1
    # 1: group1     setosa 25.13333
    # 2: group2     setosa 24.87500
    # 3: group2 versicolor 30.50000
    # 4: group3 versicolor 29.90000
    # 5: group4 versicolor 28.20000
    # 6: group4  virginica 32.80000
    # 7: group5  virginica 33.03333
    

    And if you want this in table format, you can use data.table's dcast method:

    library(magrittr) # for the %>% operator
    data[, mean(id_type_sum), by = .(type, Species)] %>%
      dcast(Species ~ type)
    

    Result:

          Species   group1 group2 group3 group4   group5
    1:     setosa 25.13333 24.875     NA     NA       NA
    2: versicolor       NA 30.500   29.9   28.2       NA
    3:  virginica       NA     NA     NA   32.8 33.03333