Search code examples
rselectmultiple-columnsmultiple-conditionssubmatrix

Grouping on two columns and performing multiple calculations


I am currently working with a large dataframe which consists of rougly 20 columns and a lot of rows.

Simplified it can look like this:

letter = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "A", "A", "A", "B", "B", "B", "C")
number = c(1,2,3,1,2,1,2,3,2,1,2,3,2,2,3,2,1)\
value1 = c(1:17)
value2 = seq(18, 50, 2)
df = data.frame(letter, number, value1, value2)

I would like to do some calculations on specific sets of values. These are the unique combinations of letter and number. So all values who have the letter A and number 1 can be summed, counted (or count_not_zero), taken minimum, or other (more elaborate) calculations. My goal is to make a new data frame with these unique combinations and their calculations.

letter  number  value1.sum  value1.count  ..  value2.max  value2.elaborate
A       1       1           1             ..  18          0.56
A       2       26          3             ..  42          0.40
A       3       15          2             ..  40          0.44
B       1       ..          ..            ..  ..          ..
..      ..      ..          ..            ..  ..          ..
C       2       16          2             ..  34          0.44
C       3       8           1             ..  32          0.50

I tried doing this in several ways. For example making a list of matrices with the letter A (in this case), and then using aggregate while selecting on the individual letters. However this gets awfully big.

I tried several combinations of DPLYR package but it was difficult to do different kind of calculations, especially selfmade ones.


Solution

  • If you're looking into dplyr, what about using group_by and summarize_each ?

    library(dplyr)
    
    df %>% 
      group_by(letter, number) %>% 
      summarize_each(funs(mean, sum, max, min, n()), value1:value2)
    
     letter number value1_mean value2_mean value1_sum value2_sum value1_max value2_max value1_min value2_min value1_n value2_n
       (chr)  (dbl)       (dbl)       (dbl)      (int)      (dbl)      (int)      (dbl)      (int)      (dbl)    (int)    (int)
    1      A      1    1.000000    18.00000          1         18          1         18          1         18        1        1
    2      A      2    8.666667    33.33333         26        100         13         42          2         20        3        3
    3      A      3    7.500000    31.00000         15         62         12         40          3         22        2        2
    4      B      1    5.000000    26.00000         10         52          6         28          4         24        2        2
    5      B      2   11.666667    39.33333         35        118         16         48          5         26        3        3
    6      B      3   15.000000    46.00000         15         46         15         46         15         46        1        1
    7      C      1   13.500000    43.00000         27         86         17         50         10         36        2        2
    8      C      2    8.000000    32.00000         16         64          9         34          7         30        2        2
    9      C      3    8.000000    32.00000          8         32          8         32          8         32        1        1
    

    You can easily make your own helper functions to output the more elaborate calculations that you're looking for, and then call them inside funs.