Search code examples
raggregatelarge-data

Use aggregate to facilitate calculations


So recently I've been learning the functions of aggregate and I find it really useful for large data sets I work with. Normally I manually manipulate data in excel which, while effective, is extremely time consuming.

So I was wondering if the following could be done. I have a data set that is similarly structured like so (but much bigger):

Fruit    Crate   Mass
Apple    A       4
Banana   A       3.4
Orange   B       2
Apple    C       2.1
Apple    C       4.5
Banana   C       5
Orange   D       1
Apple    D       1.3
Orange   D       2.4
Orange   D       3.2
Orange   E       2
Banana   E       1.1
Banana   E       0.7
Apple    E       2

Now I know with this I get Mass per crate:

TotalCrate<-aggregate(data$Mass,list(crate=data$Crate), sum)

And with this I get Mass per fruit per crate:

FruitperCrate<-aggregate(data$Mass, list(fruit=data$Fruit, crate=data$Crate), sum)

Now is there a way that I can get percent mass of fruit per crate, so basically, is there a way that I can divide the fruit mass by the total mass of for each respective crate? And if so, how can I do it for future reference.

Any help is appreciated.

Thank you


Solution

  • 1) First aggregate by Fruit and Crate and then use ave with prop.table to get the proportions of each fruit in its crate:

    ag <- aggregate(Mass ~ Fruit + Crate, data, sum)
    tr <- transform(ag, percent = 100 * ave(Mass, Crate, FUN = prop.table))
    

    giving:

    > tr
        Fruit Crate Mass   percent
    1   Apple     A  4.0  54.05405
    2  Banana     A  3.4  45.94595
    3  Orange     B  2.0 100.00000
    4   Apple     C  6.6  56.89655
    5  Banana     C  5.0  43.10345
    6   Apple     D  1.3  16.45570
    7  Orange     D  6.6  83.54430
    8   Apple     E  2.0  34.48276
    9  Banana     E  1.8  31.03448
    10 Orange     E  2.0  34.48276
    

    or graphically:

    library(ggplot2)
    ggplot(tr, aes(Crate, percent, fill = Fruit)) + 
       geom_bar(stat = "identity") + 
       scale_fill_manual(values = c("red", "yellow", "orange"))
    

    screenshot

    1a) This could also be expressed in a magrittr pipeline like this:

    library(magrittr)
    data %>%
         do(aggregate(Mass ~ Fruit + Crate, ., sum)) %>%
         transform(percent = 100 * ave(Mass, Crate, FUN = prop.table))
    

    2) and here is an alternative using dplyr that follows similar logic:

    library(dplyr)
    data %>% 
       group_by(Crate, Fruit) %>%
       summarize(Mass = sum(Mass)) %>%
       ungroup() %>%
       group_by(Crate) %>%
       mutate(percent = 100 * prop.table(Mass)) %>%
       ungroup()
    

    giving:

    # A tibble: 10 x 4
        Crate  Fruit  Mass   percent
       <fctr> <fctr> <dbl>     <dbl>
    1       A  Apple   4.0  54.05405
    2       A Banana   3.4  45.94595
    3       B Orange   2.0 100.00000
    4       C  Apple   6.6  56.89655
    5       C Banana   5.0  43.10345
    6       D  Apple   1.3  16.45570
    7       D Orange   6.6  83.54430
    8       E  Apple   2.0  34.48276
    9       E Banana   1.8  31.03448
    10      E Orange   2.0  34.48276
    

    3) A 2d layout could be had using xtabs:

    xt <- 100 * prop.table(xtabs(Mass ~ Crate + Fruit, data), 1)
    

    giving:

    > xt
         Fruit
    Crate     Apple    Banana    Orange
        A  54.05405  45.94595   0.00000
        B   0.00000   0.00000 100.00000
        C  56.89655  43.10345   0.00000
        D  16.45570   0.00000  83.54430
        E  34.48276  31.03448  34.48276
    

    which can be readily be shown in a graph like this:

    plot(xt, col = c("red", "yellow", "orange"), 
         main = "Proportion of Mass of Fruit per Crates")
    

    giving:

    screenshot

    The 2d layout can be reformed into long form with ftable:

    ftable(xt, row.vars = 1:2)
    

    giving:

    Crate Fruit            
    A     Apple    54.05405
          Banana   45.94595
          Orange    0.00000
    B     Apple     0.00000
          Banana    0.00000
          Orange  100.00000
    C     Apple    56.89655
          Banana   43.10345
          Orange    0.00000
    D     Apple    16.45570
          Banana    0.00000
          Orange   83.54430
    E     Apple    34.48276
          Banana   31.03448
          Orange   34.48276
    

    Note 1: The two lines of code in the question could be written using formula notation like this:

    aggregate(Mass ~ Crate, data, sum)
    
    aggregate(Mass ~ Fruit + Crate, data, sum)
    

    Note 2: The input used in reproducible form is:

    Lines <- "Fruit    Crate   Mass
    Apple    A       4
    Banana   A       3.4
    Orange   B       2
    Apple    C       2.1
    Apple    C       4.5
    Banana   C       5
    Orange   D       1
    Apple    D       1.3
    Orange   D       2.4
    Orange   D       3.2
    Orange   E       2
    Banana   E       1.1
    Banana   E       0.7
    Apple    E       2"
    data <- read.table(text = Lines, header = TRUE)