Search code examples
raggregate

Aggregating rows with shared values and simultaneously choosing value in a separate column to keep in aggregated row


hope you are well. I would like to a) add together values in Column A if the values in Column B are equal to one another AND the values in Column C are equal to one another, and simultaneously b) in the new summed row, only keep the value from Column D matching the maximal value of Column A from that summed group.

I think it is difficult to explain my query without an example.

Let us assume these are the relevant data:

df <- data.frame (A = c(10, 1, 4, 3, 7), 
                  B = c("a", "a", "b", "b", "b"), 
                  C = c(.5, .5, 2.5, 1.5, 2.5), 
                  D = c(54, 36, 94, 57, 49))

resulting in this dataframe:

   A B C   D
1 10 a 0.5 54
2  1 a 0.5 36
3  4 b 2.5 94
4  3 b 1.5 57
5  7 b 2.5 49

Notice that rows 1 and 2 are equivalent in B and C, so they should be summed. But row 1 has the greater value in A of the two, so 54 should be kept instead of 36. It's similar with rows 3 and 5. The end result should be:

A   B   C   D
11  a   .5  54
3   b   1.5 57
11  b   2.5 49

I am partway there. I have found some code that does part a). Either of these does the trick:

aggregate(A ~ B + C, df, sum)

library(data.table)
setDT(df)[, .(summedvar = sum(A)), by = .(A, B)]

However, these approaches delete Column D, unsurprisingly. I am curious if anyone has any ideas about how to incorporate part b). Maybe I need to do multiple steps? Or maybe I'm going about this the wrong way? I’m deeply grateful for any advice.


Solution

  • You can use which.max to get the index of highest value in A and get the corresponding value of D.

    Using dplyr you may do -

    library(dplyr)
    
    df %>%
      group_by(B, C) %>%
      summarise(D = D[which.max(A)], 
                A = sum(A), .groups = "drop") %>%
      select(A, B, C, D)
    
    #     A B         C     D
    #  <dbl> <chr> <dbl> <dbl>
    #1    11 a       0.5    54
    #2     3 b       1.5    57
    #3    11 b       2.5    49
    

    Similarly in data.table -

    library(data.table)
    
    setDT(df)
    df[, .(A = sum(A), D = D[which.max(A)]), .(B, C)]