hope you are well. I would like to a) add together values in Column A if the values in Column B are equal to one another AND the values in Column C are equal to one another, and simultaneously b) in the new summed row, only keep the value from Column D matching the maximal value of Column A from that summed group.
I think it is difficult to explain my query without an example.
Let us assume these are the relevant data:
df <- data.frame (A = c(10, 1, 4, 3, 7),
B = c("a", "a", "b", "b", "b"),
C = c(.5, .5, 2.5, 1.5, 2.5),
D = c(54, 36, 94, 57, 49))
resulting in this dataframe:
A B C D
1 10 a 0.5 54
2 1 a 0.5 36
3 4 b 2.5 94
4 3 b 1.5 57
5 7 b 2.5 49
Notice that rows 1 and 2 are equivalent in B and C, so they should be summed. But row 1 has the greater value in A of the two, so 54 should be kept instead of 36. It's similar with rows 3 and 5. The end result should be:
A B C D
11 a .5 54
3 b 1.5 57
11 b 2.5 49
I am partway there. I have found some code that does part a). Either of these does the trick:
aggregate(A ~ B + C, df, sum)
library(data.table)
setDT(df)[, .(summedvar = sum(A)), by = .(A, B)]
However, these approaches delete Column D, unsurprisingly. I am curious if anyone has any ideas about how to incorporate part b). Maybe I need to do multiple steps? Or maybe I'm going about this the wrong way? I’m deeply grateful for any advice.
You can use which.max
to get the index of highest value in A
and get the corresponding value of D
.
Using dplyr
you may do -
library(dplyr)
df %>%
group_by(B, C) %>%
summarise(D = D[which.max(A)],
A = sum(A), .groups = "drop") %>%
select(A, B, C, D)
# A B C D
# <dbl> <chr> <dbl> <dbl>
#1 11 a 0.5 54
#2 3 b 1.5 57
#3 11 b 2.5 49
Similarly in data.table
-
library(data.table)
setDT(df)
df[, .(A = sum(A), D = D[which.max(A)]), .(B, C)]